bzrformats_3.4.0.orig/.mailmap0000644000000000000000000000045313273565646013322 0ustar00Jelmer Vernooij Jelmer Vernooij Jelmer Vernooij INADA Naoki Martin Packman bzrformats_3.4.0.orig/.testr.conf0000644000000000000000000000032115162074037013745 0ustar00[DEFAULT] test_command=PYTHONPATH=`pwd`:$PYTHONPATH BRZ_PLUGIN_PATH=-site:-user python3 -m subunit.run bzrformats.tests.test_suite $IDOPTION $LISTOPT test_id_option=--load-list $IDFILE test_list_option=--list bzrformats_3.4.0.orig/CODE_OF_CONDUCT.md0000644000000000000000000000642713677744666014520 0ustar00# Contributor Covenant Code of Conduct ## Our Pledge In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. ## Our Standards Examples of behavior that contributes to creating a positive environment include: * Using welcoming and inclusive language * Being respectful of differing viewpoints and experiences * Gracefully accepting constructive criticism * Focusing on what is best for the community * Showing empathy towards other community members Examples of unacceptable behavior by participants include: * The use of sexualized language or imagery and unwelcome sexual attention or advances * Trolling, insulting/derogatory comments, and personal or political attacks * Public or private harassment * Publishing others' private information, such as a physical or electronic address, without explicit permission * Other conduct which could reasonably be considered inappropriate in a professional setting ## Our Responsibilities Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. ## Scope This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. ## Enforcement Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at core@breezy-vcs.org. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately. Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership. ## Attribution This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html [homepage]: https://www.contributor-covenant.org For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq bzrformats_3.4.0.orig/COPYING.txt0000644000000000000000000004325412706057761013552 0ustar00 GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Lesser General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. bzrformats_3.4.0.orig/Cargo.lock0000644000000000000000000010266615162206230013573 0ustar00# This file is automatically @generated by Cargo. # It is not intended for manual editing. version = 4 [[package]] name = "addr2line" version = "0.25.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1b5d307320b3181d6d7954e663bd7c774a838b8220fe0593c86d9fb09f498b4b" dependencies = [ "gimli", ] [[package]] name = "adler2" version = "2.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa" [[package]] name = "aho-corasick" version = "1.1.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ddd31a130427c27518df266943a5308ed92d4b226cc639f5a8f1002816174301" dependencies = [ "memchr", ] [[package]] name = "allocator-api2" version = "0.2.21" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "683d7910e743518b0e34f1186f92494becacb047c7b6bf616c96772180fef923" [[package]] name = "android_system_properties" version = "0.1.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "819e7219dbd41043ac279b19830f2efc897156490d7fd6ea916720117ee66311" dependencies = [ "libc", ] [[package]] name = "anyhow" version = "1.0.102" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c" [[package]] name = "autocfg" version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" [[package]] name = "backtrace" version = "0.3.76" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "bb531853791a215d7c62a30daf0dde835f381ab5de4589cfe7c649d2cbe92bd6" dependencies = [ "addr2line", "cfg-if", "libc", "miniz_oxide", "object", "rustc-demangle", "windows-link", ] [[package]] name = "base64" version = "0.22.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6" [[package]] name = "bazaar" version = "3.4.0" dependencies = [ "base64", "bendy", "byteorder", "chrono", "crc32fast", "fancy-regex", "flate2", "lazy-regex", "lazy_static", "log", "lru", "maplit", "nix", "osutils", "pyo3", "regex", "sha1", "tempfile", "xmltree", "xz2", ] [[package]] name = "bazaar-py" version = "3.4.0" dependencies = [ "bazaar", "chrono", "osutils", "pyo3", "pyo3-filelike", ] [[package]] name = "bendy" version = "0.3.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8133e404c8bec821e531f347dab1247bf64f60882826e7228f8ffeb33a35a658" dependencies = [ "failure", ] [[package]] name = "bit-set" version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "08807e080ed7f9d5433fa9b275196cfc35414f66a0c79d864dc51a0d825231a3" dependencies = [ "bit-vec", ] [[package]] name = "bit-vec" version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5e764a1d40d510daf35e07be9eb06e75770908c27d411ee6c92109c9840eaaf7" [[package]] name = "bitflags" version = "2.11.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "843867be96c8daad0d758b57df9392b6d8d271134fce549de6ce169ff98a92af" [[package]] name = "block-buffer" version = "0.10.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3078c7629b62d3f0439517fa394996acacc5cbc91c5a20d8c658e77abd503a71" dependencies = [ "generic-array", ] [[package]] name = "bumpalo" version = "3.20.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5d20789868f4b01b2f2caec9f5c4e0213b41e3e5702a50157d699ae31ced2fcb" [[package]] name = "byteorder" version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b" [[package]] name = "bzrformats-osutils" version = "0.1.0" dependencies = [ "osutils", "pyo3", "walkdir", ] [[package]] name = "cc" version = "1.2.58" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e1e928d4b69e3077709075a938a05ffbedfa53a84c8f766efbf8220bb1ff60e1" dependencies = [ "find-msvc-tools", "shlex", ] [[package]] name = "cfg-if" version = "1.0.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801" [[package]] name = "cfg_aliases" version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "613afe47fcd5fac7ccf1db93babcb082c5994d996f20b8b159f2ad1658eb5724" [[package]] name = "chrono" version = "0.4.44" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c673075a2e0e5f4a1dde27ce9dee1ea4558c7ffe648f576438a20ca1d2acc4b0" dependencies = [ "iana-time-zone", "js-sys", "num-traits", "wasm-bindgen", "windows-link", ] [[package]] name = "core-foundation-sys" version = "0.8.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "773648b94d0e5d620f64f280777445740e61fe701025087ec8b57f45c791888b" [[package]] name = "cpufeatures" version = "0.2.17" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "59ed5838eebb26a2bb2e58f6d5b5316989ae9d08bab10e0e6d103e656d1b0280" dependencies = [ "libc", ] [[package]] name = "crc32fast" version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9481c1c90cbf2ac953f07c8d4a58aa3945c425b7185c9154d67a65e4230da511" dependencies = [ "cfg-if", ] [[package]] name = "crypto-common" version = "0.1.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a" dependencies = [ "generic-array", "typenum", ] [[package]] name = "digest" version = "0.10.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292" dependencies = [ "block-buffer", "crypto-common", ] [[package]] name = "equivalent" version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f" [[package]] name = "errno" version = "0.3.14" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb" dependencies = [ "libc", "windows-sys", ] [[package]] name = "failure" version = "0.1.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d32e9bd16cc02eae7db7ef620b392808b89f6a5e16bb3497d159c6b92a0f4f86" dependencies = [ "backtrace", "failure_derive", ] [[package]] name = "failure_derive" version = "0.1.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "aa4da3c766cd7a0db8242e326e9e4e081edd567072893ed320008189715366a4" dependencies = [ "proc-macro2", "quote", "syn 1.0.109", "synstructure", ] [[package]] name = "fancy-regex" version = "0.17.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "72cf461f865c862bb7dc573f643dd6a2b6842f7c30b07882b56bd148cc2761b8" dependencies = [ "bit-set", "regex-automata", "regex-syntax", ] [[package]] name = "fastrand" version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be" [[package]] name = "find-msvc-tools" version = "0.1.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5baebc0774151f905a1a2cc41989300b1e6fbb29aff0ceffa1064fdd3088d582" [[package]] name = "flate2" version = "1.1.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "843fba2746e448b37e26a819579957415c8cef339bf08564fe8b7ddbd959573c" dependencies = [ "crc32fast", "miniz_oxide", ] [[package]] name = "foldhash" version = "0.1.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2" [[package]] name = "generic-array" version = "0.14.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "85649ca51fd72272d7821adaf274ad91c288277713d9c18820d8499a7ff69e9a" dependencies = [ "typenum", "version_check", ] [[package]] name = "getrandom" version = "0.3.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd" dependencies = [ "cfg-if", "libc", "r-efi 5.3.0", "wasip2", ] [[package]] name = "getrandom" version = "0.4.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0de51e6874e94e7bf76d726fc5d13ba782deca734ff60d5bb2fb2607c7406555" dependencies = [ "cfg-if", "libc", "r-efi 6.0.0", "wasip2", "wasip3", ] [[package]] name = "gimli" version = "0.32.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e629b9b98ef3dd8afe6ca2bd0f89306cec16d43d907889945bc5d6687f2f13c7" [[package]] name = "hashbrown" version = "0.15.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1" dependencies = [ "allocator-api2", "equivalent", "foldhash", ] [[package]] name = "hashbrown" version = "0.16.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "841d1cc9bed7f9236f321df977030373f4a4163ae1a7dbfe1a51a2c1a51d9100" [[package]] name = "heck" version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" [[package]] name = "iana-time-zone" version = "0.1.65" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e31bc9ad994ba00e440a8aa5c9ef0ec67d5cb5e5cb0cc7f8b744a35b389cc470" dependencies = [ "android_system_properties", "core-foundation-sys", "iana-time-zone-haiku", "js-sys", "log", "wasm-bindgen", "windows-core", ] [[package]] name = "iana-time-zone-haiku" version = "0.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f31827a206f56af32e590ba56d5d2d085f558508192593743f16b2306495269f" dependencies = [ "cc", ] [[package]] name = "id-arena" version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3d3067d79b975e8844ca9eb072e16b31c3c1c36928edf9c6789548c524d0d954" [[package]] name = "indexmap" version = "2.13.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7714e70437a7dc3ac8eb7e6f8df75fd8eb422675fc7678aff7364301092b1017" dependencies = [ "equivalent", "hashbrown 0.16.1", "serde", "serde_core", ] [[package]] name = "indoc" version = "2.0.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "79cf5c93f93228cf8efb3ba362535fb11199ac548a09ce117c9b1adc3030d706" dependencies = [ "rustversion", ] [[package]] name = "itoa" version = "1.0.18" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8f42a60cbdf9a97f5d2305f08a87dc4e09308d1276d28c869c684d7777685682" [[package]] name = "js-sys" version = "0.3.92" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "cc4c90f45aa2e6eacbe8645f77fdea542ac97a494bcd117a67df9ff4d611f995" dependencies = [ "once_cell", "wasm-bindgen", ] [[package]] name = "lazy-regex" version = "3.6.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6bae91019476d3ec7147de9aa291cadb6d870abf2f3015d2da73a90325ac1496" dependencies = [ "lazy-regex-proc_macros", "once_cell", "regex", ] [[package]] name = "lazy-regex-proc_macros" version = "3.6.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4de9c1e1439d8b7b3061b2d209809f447ca33241733d9a3c01eabf2dc8d94358" dependencies = [ "proc-macro2", "quote", "regex", "syn 2.0.117", ] [[package]] name = "lazy_static" version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe" [[package]] name = "leb128fmt" version = "0.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "09edd9e8b54e49e587e4f6295a7d29c3ea94d469cb40ab8ca70b288248a81db2" [[package]] name = "libc" version = "0.2.183" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b5b646652bf6661599e1da8901b3b9522896f01e736bad5f723fe7a3a27f899d" [[package]] name = "linux-raw-sys" version = "0.12.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "32a66949e030da00e8c7d4434b251670a91556f4144941d37452769c25d58a53" [[package]] name = "log" version = "0.4.29" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" [[package]] name = "lru" version = "0.13.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "227748d55f2f0ab4735d87fd623798cb6b664512fe979705f829c9f81c934465" dependencies = [ "hashbrown 0.15.5", ] [[package]] name = "lzma-sys" version = "0.1.20" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5fda04ab3764e6cde78b9974eec4f779acaba7c4e84b36eca3cf77c581b85d27" dependencies = [ "cc", "libc", "pkg-config", ] [[package]] name = "maplit" version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3e2e65a1a2e43cfcb47a895c4c8b10d1f4a61097f9f254f183aee60cad9c651d" [[package]] name = "memchr" version = "2.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79" [[package]] name = "memoffset" version = "0.9.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "488016bfae457b036d996092f6cb448677611ce4449e970ceaf42695203f218a" dependencies = [ "autocfg", ] [[package]] name = "miniz_oxide" version = "0.8.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316" dependencies = [ "adler2", "simd-adler32", ] [[package]] name = "nix" version = "0.31.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5d6d0705320c1e6ba1d912b5e37cf18071b6c2e9b7fa8215a1e8a7651966f5d3" dependencies = [ "bitflags", "cfg-if", "cfg_aliases", "libc", ] [[package]] name = "num-traits" version = "0.2.19" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "071dfc062690e90b734c0b2273ce72ad0ffa95f0c74596bc250dcfd960262841" dependencies = [ "autocfg", ] [[package]] name = "object" version = "0.37.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ff76201f031d8863c38aa7f905eca4f53abbfa15f609db4277d44cd8938f33fe" dependencies = [ "memchr", ] [[package]] name = "once_cell" version = "1.21.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50" [[package]] name = "osutils" version = "0.1.0" dependencies = [ "chrono", "lazy_static", "log", "memchr", "pyo3", "rand", "sha1", "unicode-normalization", "winapi", ] [[package]] name = "pkg-config" version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c" [[package]] name = "portable-atomic" version = "1.13.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c33a9471896f1c69cecef8d20cbe2f7accd12527ce60845ff44c153bb2a21b49" [[package]] name = "ppv-lite86" version = "0.2.21" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9" dependencies = [ "zerocopy", ] [[package]] name = "prettyplease" version = "0.2.37" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "479ca8adacdd7ce8f1fb39ce9ecccbfe93a3f1344b3d0d97f20bc0196208f62b" dependencies = [ "proc-macro2", "syn 2.0.117", ] [[package]] name = "proc-macro2" version = "1.0.106" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934" dependencies = [ "unicode-ident", ] [[package]] name = "pyo3" version = "0.27.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ab53c047fcd1a1d2a8820fe84f05d6be69e9526be40cb03b73f86b6b03e6d87d" dependencies = [ "chrono", "indoc", "libc", "memoffset", "once_cell", "portable-atomic", "pyo3-build-config", "pyo3-ffi", "pyo3-macros", "unindent", ] [[package]] name = "pyo3-build-config" version = "0.27.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b455933107de8642b4487ed26d912c2d899dec6114884214a0b3bb3be9261ea6" dependencies = [ "target-lexicon", ] [[package]] name = "pyo3-ffi" version = "0.27.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1c85c9cbfaddf651b1221594209aed57e9e5cff63c4d11d1feead529b872a089" dependencies = [ "libc", "pyo3-build-config", ] [[package]] name = "pyo3-filelike" version = "0.5.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5a8cb6cd0231ea816b4452c0cd37b5215f9ec45b66ed3e748fad8eb39cfd4997" dependencies = [ "pyo3", ] [[package]] name = "pyo3-macros" version = "0.27.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0a5b10c9bf9888125d917fb4d2ca2d25c8df94c7ab5a52e13313a07e050a3b02" dependencies = [ "proc-macro2", "pyo3-macros-backend", "quote", "syn 2.0.117", ] [[package]] name = "pyo3-macros-backend" version = "0.27.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "03b51720d314836e53327f5871d4c0cfb4fb37cc2c4a11cc71907a86342c40f9" dependencies = [ "heck", "proc-macro2", "pyo3-build-config", "quote", "syn 2.0.117", ] [[package]] name = "quote" version = "1.0.45" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924" dependencies = [ "proc-macro2", ] [[package]] name = "r-efi" version = "5.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f" [[package]] name = "r-efi" version = "6.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f8dcc9c7d52a811697d2151c701e0d08956f92b0e24136cf4cf27b57a6a0d9bf" [[package]] name = "rand" version = "0.9.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6db2770f06117d490610c7488547d543617b21bfa07796d7a12f6f1bd53850d1" dependencies = [ "rand_chacha", "rand_core", ] [[package]] name = "rand_chacha" version = "0.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb" dependencies = [ "ppv-lite86", "rand_core", ] [[package]] name = "rand_core" version = "0.9.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "76afc826de14238e6e8c374ddcc1fa19e374fd8dd986b0d2af0d02377261d83c" dependencies = [ "getrandom 0.3.4", ] [[package]] name = "regex" version = "1.12.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e10754a14b9137dd7b1e3e5b0493cc9171fdd105e0ab477f51b72e7f3ac0e276" dependencies = [ "aho-corasick", "memchr", "regex-automata", "regex-syntax", ] [[package]] name = "regex-automata" version = "0.4.14" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6e1dd4122fc1595e8162618945476892eefca7b88c52820e74af6262213cae8f" dependencies = [ "aho-corasick", "memchr", "regex-syntax", ] [[package]] name = "regex-syntax" version = "0.8.10" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a" [[package]] name = "rustc-demangle" version = "0.1.27" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b50b8869d9fc858ce7266cce0194bd74df58b9d0e3f6df3a9fc8eb470d95c09d" [[package]] name = "rustix" version = "1.1.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b6fe4565b9518b83ef4f91bb47ce29620ca828bd32cb7e408f0062e9930ba190" dependencies = [ "bitflags", "errno", "libc", "linux-raw-sys", "windows-sys", ] [[package]] name = "rustversion" version = "1.0.22" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b39cdef0fa800fc44525c84ccb54a029961a8215f9619753635a9c0d2538d46d" [[package]] name = "same-file" version = "1.0.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "93fc1dc3aaa9bfed95e02e6eadabb4baf7e3078b0bd1b4d7b6b0b68378900502" dependencies = [ "winapi-util", ] [[package]] name = "semver" version = "1.0.27" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d767eb0aabc880b29956c35734170f26ed551a859dbd361d140cdbeca61ab1e2" [[package]] name = "serde" version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e" dependencies = [ "serde_core", ] [[package]] name = "serde_core" version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad" dependencies = [ "serde_derive", ] [[package]] name = "serde_derive" version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79" dependencies = [ "proc-macro2", "quote", "syn 2.0.117", ] [[package]] name = "serde_json" version = "1.0.149" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "83fc039473c5595ace860d8c4fafa220ff474b3fc6bfdb4293327f1a37e94d86" dependencies = [ "itoa", "memchr", "serde", "serde_core", "zmij", ] [[package]] name = "sha1" version = "0.10.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e3bf829a2d51ab4a5ddf1352d8470c140cadc8301b2ae1789db023f01cedd6ba" dependencies = [ "cfg-if", "cpufeatures", "digest", ] [[package]] name = "shlex" version = "1.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64" [[package]] name = "simd-adler32" version = "0.3.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "703d5c7ef118737c72f1af64ad2f6f8c5e1921f818cdcb97b8fe6fc69bf66214" [[package]] name = "syn" version = "1.0.109" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "72b64191b275b66ffe2469e8af2c1cfe3bafa67b529ead792a6d0160888b4237" dependencies = [ "proc-macro2", "quote", "unicode-ident", ] [[package]] name = "syn" version = "2.0.117" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99" dependencies = [ "proc-macro2", "quote", "unicode-ident", ] [[package]] name = "synstructure" version = "0.12.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f36bdaa60a83aca3921b5259d5400cbf5e90fc51931376a9bd4a0eb79aa7210f" dependencies = [ "proc-macro2", "quote", "syn 1.0.109", "unicode-xid", ] [[package]] name = "target-lexicon" version = "0.13.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "adb6935a6f5c20170eeceb1a3835a49e12e19d792f6dd344ccc76a985ca5a6ca" [[package]] name = "tempfile" version = "3.27.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "32497e9a4c7b38532efcdebeef879707aa9f794296a4f0244f6f69e9bc8574bd" dependencies = [ "fastrand", "getrandom 0.4.2", "once_cell", "rustix", "windows-sys", ] [[package]] name = "tinyvec" version = "1.11.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3e61e67053d25a4e82c844e8424039d9745781b3fc4f32b8d55ed50f5f667ef3" dependencies = [ "tinyvec_macros", ] [[package]] name = "tinyvec_macros" version = "0.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1f3ccbac311fea05f86f61904b462b55fb3df8837a366dfc601a0161d0532f20" [[package]] name = "typenum" version = "1.19.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "562d481066bde0658276a35467c4af00bdc6ee726305698a55b86e61d7ad82bb" [[package]] name = "unicode-ident" version = "1.0.24" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75" [[package]] name = "unicode-normalization" version = "0.1.25" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5fd4f6878c9cb28d874b009da9e8d183b5abc80117c40bbd187a1fde336be6e8" dependencies = [ "tinyvec", ] [[package]] name = "unicode-xid" version = "0.2.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853" [[package]] name = "unindent" version = "0.2.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7264e107f553ccae879d21fbea1d6724ac785e8c3bfc762137959b5802826ef3" [[package]] name = "version_check" version = "0.9.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" [[package]] name = "walkdir" version = "2.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "29790946404f91d9c5d06f9874efddea1dc06c5efe94541a7d6863108e3a5e4b" dependencies = [ "same-file", "winapi-util", ] [[package]] name = "wasip2" version = "1.0.2+wasi-0.2.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9517f9239f02c069db75e65f174b3da828fe5f5b945c4dd26bd25d89c03ebcf5" dependencies = [ "wit-bindgen", ] [[package]] name = "wasip3" version = "0.4.0+wasi-0.3.0-rc-2026-01-06" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5428f8bf88ea5ddc08faddef2ac4a67e390b88186c703ce6dbd955e1c145aca5" dependencies = [ "wit-bindgen", ] [[package]] name = "wasm-bindgen" version = "0.2.115" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6523d69017b7633e396a89c5efab138161ed5aafcbc8d3e5c5a42ae38f50495a" dependencies = [ "cfg-if", "once_cell", "rustversion", "wasm-bindgen-macro", "wasm-bindgen-shared", ] [[package]] name = "wasm-bindgen-macro" version = "0.2.115" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4e3a6c758eb2f701ed3d052ff5737f5bfe6614326ea7f3bbac7156192dc32e67" dependencies = [ "quote", "wasm-bindgen-macro-support", ] [[package]] name = "wasm-bindgen-macro-support" version = "0.2.115" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "921de2737904886b52bcbb237301552d05969a6f9c40d261eb0533c8b055fedf" dependencies = [ "bumpalo", "proc-macro2", "quote", "syn 2.0.117", "wasm-bindgen-shared", ] [[package]] name = "wasm-bindgen-shared" version = "0.2.115" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a93e946af942b58934c604527337bad9ae33ba1d5c6900bbb41c2c07c2364a93" dependencies = [ "unicode-ident", ] [[package]] name = "wasm-encoder" version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "990065f2fe63003fe337b932cfb5e3b80e0b4d0f5ff650e6985b1048f62c8319" dependencies = [ "leb128fmt", "wasmparser", ] [[package]] name = "wasm-metadata" version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "bb0e353e6a2fbdc176932bbaab493762eb1255a7900fe0fea1a2f96c296cc909" dependencies = [ "anyhow", "indexmap", "wasm-encoder", "wasmparser", ] [[package]] name = "wasmparser" version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "47b807c72e1bac69382b3a6fb3dbe8ea4c0ed87ff5629b8685ae6b9a611028fe" dependencies = [ "bitflags", "hashbrown 0.15.5", "indexmap", "semver", ] [[package]] name = "winapi" version = "0.3.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" dependencies = [ "winapi-i686-pc-windows-gnu", "winapi-x86_64-pc-windows-gnu", ] [[package]] name = "winapi-i686-pc-windows-gnu" version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" [[package]] name = "winapi-util" version = "0.1.11" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" dependencies = [ "windows-sys", ] [[package]] name = "winapi-x86_64-pc-windows-gnu" version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" [[package]] name = "windows-core" version = "0.62.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b8e83a14d34d0623b51dce9581199302a221863196a1dde71a7663a4c2be9deb" dependencies = [ "windows-implement", "windows-interface", "windows-link", "windows-result", "windows-strings", ] [[package]] name = "windows-implement" version = "0.60.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "053e2e040ab57b9dc951b72c264860db7eb3b0200ba345b4e4c3b14f67855ddf" dependencies = [ "proc-macro2", "quote", "syn 2.0.117", ] [[package]] name = "windows-interface" version = "0.59.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3f316c4a2570ba26bbec722032c4099d8c8bc095efccdc15688708623367e358" dependencies = [ "proc-macro2", "quote", "syn 2.0.117", ] [[package]] name = "windows-link" version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5" [[package]] name = "windows-result" version = "0.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7781fa89eaf60850ac3d2da7af8e5242a5ea78d1a11c49bf2910bb5a73853eb5" dependencies = [ "windows-link", ] [[package]] name = "windows-strings" version = "0.5.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7837d08f69c77cf6b07689544538e017c1bfcf57e34b4c0ff58e6c2cd3b37091" dependencies = [ "windows-link", ] [[package]] name = "windows-sys" version = "0.61.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ae137229bcbd6cdf0f7b80a31df61766145077ddf49416a728b02cb3921ff3fc" dependencies = [ "windows-link", ] [[package]] name = "wit-bindgen" version = "0.51.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d7249219f66ced02969388cf2bb044a09756a083d0fab1e566056b04d9fbcaa5" dependencies = [ "wit-bindgen-rust-macro", ] [[package]] name = "wit-bindgen-core" version = "0.51.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ea61de684c3ea68cb082b7a88508a8b27fcc8b797d738bfc99a82facf1d752dc" dependencies = [ "anyhow", "heck", "wit-parser", ] [[package]] name = "wit-bindgen-rust" version = "0.51.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b7c566e0f4b284dd6561c786d9cb0142da491f46a9fbed79ea69cdad5db17f21" dependencies = [ "anyhow", "heck", "indexmap", "prettyplease", "syn 2.0.117", "wasm-metadata", "wit-bindgen-core", "wit-component", ] [[package]] name = "wit-bindgen-rust-macro" version = "0.51.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0c0f9bfd77e6a48eccf51359e3ae77140a7f50b1e2ebfe62422d8afdaffab17a" dependencies = [ "anyhow", "prettyplease", "proc-macro2", "quote", "syn 2.0.117", "wit-bindgen-core", "wit-bindgen-rust", ] [[package]] name = "wit-component" version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9d66ea20e9553b30172b5e831994e35fbde2d165325bec84fc43dbf6f4eb9cb2" dependencies = [ "anyhow", "bitflags", "indexmap", "log", "serde", "serde_derive", "serde_json", "wasm-encoder", "wasm-metadata", "wasmparser", "wit-parser", ] [[package]] name = "wit-parser" version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ecc8ac4bc1dc3381b7f59c34f00b67e18f910c2c0f50015669dde7def656a736" dependencies = [ "anyhow", "id-arena", "indexmap", "log", "semver", "serde", "serde_derive", "serde_json", "unicode-xid", "wasmparser", ] [[package]] name = "xml-rs" version = "0.8.28" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3ae8337f8a065cfc972643663ea4279e04e7256de865aa66fe25cec5fb912d3f" [[package]] name = "xmltree" version = "0.11.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b619f8c85654798007fb10afa5125590b43b088c225a25fc2fec100a9fad0fc6" dependencies = [ "xml-rs", ] [[package]] name = "xz2" version = "0.1.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "388c44dc09d76f1536602ead6d325eb532f5c122f17782bd57fb47baeeb767e2" dependencies = [ "lzma-sys", ] [[package]] name = "zerocopy" version = "0.8.48" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "eed437bf9d6692032087e337407a86f04cd8d6a16a37199ed57949d415bd68e9" dependencies = [ "zerocopy-derive", ] [[package]] name = "zerocopy-derive" version = "0.8.48" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "70e3cd084b1788766f53af483dd21f93881ff30d7320490ec3ef7526d203bad4" dependencies = [ "proc-macro2", "quote", "syn 2.0.117", ] [[package]] name = "zmij" version = "1.0.21" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa" bzrformats_3.4.0.orig/Cargo.toml0000644000000000000000000000040115162115076013605 0ustar00[workspace] members = ["crates/*"] [workspace.package] version = "3.4.0" [workspace.dependencies] nix = ">=0.26" pyo3 = ">=0.26,<0.28" pyo3-filelike = "0.5.0" chrono = { version = "0.4", default-features = false, features = ["std", "clock"] } log = "0.4" bzrformats_3.4.0.orig/MANIFEST.in0000644000000000000000000000022315162074037013416 0ustar00include README.rst setup.py COPYING.txt recursive-include crates Cargo.toml *.rs include Cargo.lock include Cargo.toml include bzrformats/py.typed bzrformats_3.4.0.orig/README.md0000644000000000000000000000350315162075770013150 0ustar00# bzrformats Core Bazaar format implementations and utilities, extracted from the [Breezy](https://www.breezy-vcs.org/) version control system. ## Overview bzrformats provides the internal format implementations that power Bazaar-compatible version control. It includes serialization, compression, indexing, and data structure modules for reading and writing Bazaar repositories, working trees, and branches. ## Features - **Versioned file storage** — knit, weave, and groupcompress formats - **Directory state tracking** — efficient metadata caching for working trees - **Serialization** — XML-based inventory and revision serialization (formats 5–8), plus CHK-based serialization - **Indexing** — graph index and B+Tree index for pack-based repositories - **Compression** — groupcompress for efficient delta storage of related files - **Pack repositories** — container format for bundling versioned data - **Rust accelerators** — performance-critical code implemented in Rust with Python bindings via PyO3 - **Cython extensions** — optional compiled extensions for hot paths ## Installation ``` pip install bzrformats ``` ### Build requirements Building from source requires: - Python >= 3.10, < 3.15 - A Rust toolchain (for the compiled extensions) - Cython >= 0.29 ## Usage This package is primarily intended for use by version control systems and tools that need to work with Bazaar format data. The modules provide building blocks for implementing Bazaar-compatible storage formats. ```python from bzrformats import knit, groupcompress, index ``` ## License GNU General Public License v2 or later (GPLv2+). See [COPYING.txt](COPYING.txt). ## History These modules were originally part of the [Breezy](https://github.com/breezy-team/breezy) project (`breezy.bzr`) and have been extracted into a standalone package. bzrformats_3.4.0.orig/bzrformats/0000755000000000000000000000000015162073400014045 5ustar00bzrformats_3.4.0.orig/crates/0000755000000000000000000000000014405061146013140 5ustar00bzrformats_3.4.0.orig/doc/0000755000000000000000000000000015162203117012421 5ustar00bzrformats_3.4.0.orig/pyproject.toml0000644000000000000000000001052415162235736014606 0ustar00[build-system] requires = [ "setuptools>=60", "setuptools-rust", "cython>=0.29", ] build-backend = "setuptools.build_meta" [project] name = "bzrformats" maintainers = [{name = "Breezy Developers", email = "team@breezy-vcs.org"}] description = "Bazaar formats" readme = "README.md" license = "GPL-2.0-or-later" classifiers = [ "Development Status :: 6 - Mature", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: System Administrators", "Operating System :: OS Independent", "Operating System :: POSIX", "Programming Language :: Python", "Programming Language :: Rust", "Programming Language :: C", "Topic :: Software Development :: Version Control", ] requires-python = ">=3.10,<3.15" dependencies = [ "catalogus", "patiencediff", "vcsgraph", ] version = "3.4.0" [project.urls] Homepage = "https://www.breezy-vcs.org/" Download = "https://launchpad.net/brz/+download" Repository = "https://github.com/breezy-team/bzrformats" [project.optional-dependencies] dev = [ "testtools", "testscenarios", "python-subunit", ] [tool.setuptools] zip-safe = false include-package-data = false [tool.setuptools.packages.find] include = ["bzrformats"] namespaces = false [tool.setuptools.package-data] bzrformats = [ "py.typed", ] [tool.mypy] ignore_missing_imports = true [tool.ruff] extend-exclude = [] [tool.ruff.lint] select = [ "ANN", # annotations "D", # pydocstyle "E", # pycodestyle "F", # pyflakes "N", # naming "B", # bugbear "I", # isort "S", # bandit "TCH", # typecheck "INT", # gettext "SIM", # simplify "C4", # comprehensions "UP", # pyupgrade "RUF", # ruf-specific ] ignore = [ "ANN001", "ANN002", "ANN003", # missing-type-arg "ANN201", "ANN202", "ANN204", "ANN205", "ANN206", "D205", # 1 blank line required between summary line and description "D417", # Missing argument descriptions in the docstring "F821", # undefined-name "E501", # line too long "D402", # Missing blank line after last section "E402", # module level import not at top of file "E741", # ambiguous variable name "F405", # name may be undefined, or defined from star imports "N801", # Naming convention violation: invalid constant name "N802", # Naming convention violation: invalid variable name "N804", # Naming convention violation: invalid lowercase variable name "N806", # Naming convention violation: invalid lowercase function name "N818", # Naming convention violation: invalid argument name "N999", # Naming convention violation: invalid module name "S110", # "consider logging exception" "S317", # use defusedxml # This triggers for docstrings that uses __doc__ "D104", # Missing docstring in public package "RUF012", # Mutable class attributes should be annotated with `typing.ClassVar` "RUF005", # Consider iterable concatenation instead of list concatenation "RUF015", # Prefer next() of single slice access "SIM102", # Use a single `if` statement instead of nested `if` statements "SIM105", # Use `contextlib.suppress "SIM108", # Use ternary operator "SIM114", # Combine `if` branches using logical `or` operator "SIM115", # Use context handler for opening files # Some objects (e.g. VersionedFiles) have a keys() method but no __iter__ "SIM118", # Use `key in dict` instead of `key in dict.keys()` "UP031", # Use format-specifier instead of `str.format` call "UP032", # Use f-string instead of `format` call; f-strings break gettext ] # These are actually fine, but they make mypy more strict and then it fails. unfixable = ["ANN204"] [tool.ruff.lint.pydocstyle] convention = "google" [tool.cibuildwheel.linux] skip = "*-musllinux_*" archs = ["auto", "aarch64"] [tool.cibuildwheel.macos] [tool.cibuildwheel.windows] [tool.ruff.lint.extend-per-file-ignores] # Ignore docstring requirements for test files "bzrformats/tests/**/*.py" = ["D100", "D101", "D102", "D103", "D104", "D105", "D106", "D107"] "bzrformats/*/tests/**/*.py" = ["D100", "D101", "D102", "D103", "D104", "D105", "D106", "D107"] "bzrformats/**/test_*.py" = ["D100", "D101", "D102", "D103", "D104", "D105", "D106", "D107"] "bzrformats/**/*_test.py" = ["D100", "D101", "D102", "D103", "D104", "D105", "D106", "D107"] bzrformats_3.4.0.orig/setup.py0000755000000000000000000001645415162115076013411 0ustar00#! /usr/bin/env python3 """Installation script for bzrformats. Run it with './setup.py install', or './setup.py --help' for more options. """ import os import os.path import sys try: import setuptools # noqa: F401 except ModuleNotFoundError as e: sys.stderr.write(f"[ERROR] Please install setuptools ({e})\n") sys.exit(1) try: from setuptools_rust import Binding, RustExtension except ModuleNotFoundError as e: sys.stderr.write(f"[ERROR] Please install setuptools_rust ({e})\n") sys.exit(1) from setuptools import setup try: from packaging.version import Version except ImportError: from distutils.version import LooseVersion as Version from distutils.command.build_scripts import build_scripts from setuptools import Command ############################### # Overridden distutils actions ############################### class brz_build_scripts(build_scripts): """Custom build_scripts command that handles Rust extension binaries. This class extends the standard build_scripts command to properly handle Rust extension binaries by moving executable Rust extensions from the build_lib directory to the scripts directory. """ def run(self): """Execute the build_scripts command and handle Rust executables. First runs the standard build_scripts process, then moves any Rust executable extensions from the build_lib directory to the scripts build directory. """ build_scripts.run(self) self.run_command("build_ext") build_ext = self.get_finalized_command("build_ext") for ext in self.distribution.rust_extensions: if ext.binding == Binding.Exec: # GZ 2021-08-19: Not handling multiple binaries yet. os.replace( os.path.join(build_ext.build_lib, ext.name), os.path.join(self.build_dir, ext.name), ) class build_man(Command): """Custom command to generate the brz.1 manual page. This command builds the Breezy extension modules and then uses the generate_docs tool to create the brz.1 manual page from the built modules. """ def initialize_options(self): """Initialize command options. No options to initialize for this command. """ pass def finalize_options(self): """Finalize command options. No options to finalize for this command. """ pass def run(self): """Execute the manual page generation. Builds the extension modules, adds the build directory to sys.path, and then imports and runs the generate_docs tool to create the brz.1 manual page. """ build_ext_cmd = self.get_finalized_command("build_ext") build_lib_dir = build_ext_cmd.build_lib sys.path.insert(0, os.path.abspath(build_lib_dir)) import importlib importlib.invalidate_caches() del sys.modules["breezy"] from tools import generate_docs generate_docs.main(["generate-docs", "man"]) ######################## ## Setup ######################## command_classes = { "build_man": build_man, } from distutils.extension import Extension ext_modules = [] try: from Cython.Compiler.Version import version as cython_version from Cython.Distutils import build_ext except ModuleNotFoundError: have_cython = False # try to build the extension from the prior generated source. print("") print( "The python package 'Cython' is not available. If the .c files are available," ) print("they will be built, but modifying the .pyx files will not rebuild them.") print("") from distutils.command.build_ext import build_ext else: minimum_cython_version = "0.29" cython_version_info = Version(cython_version) if cython_version_info < Version(minimum_cython_version): print( "Version of Cython is too old. " f"Current is {cython_version}, need at least {minimum_cython_version}." ) print( "If the .c files are available, they will be built," " but modifying the .pyx files will not rebuild them." ) have_cython = False else: have_cython = True # Override the build_ext if we have Cython available command_classes["build_ext"] = build_ext unavailable_files = [] def add_cython_extension(module_name, libraries=None, extra_source=None): """Add a Cython extension module to the build configuration. This function configures a Cython extension for building. If Cython is available, it will compile from .pyx files. Otherwise, it falls back to pre-generated .c files. If neither is available, the extension is skipped with a warning. Args: module_name (str): The python path to the module (e.g., 'bzrformats.foo'). This determines the .pyx and .c file paths to use. libraries (list, optional): List of libraries to link against. Defaults to None. extra_source (list, optional): Additional source files to include. Defaults to None. Note: On Windows, the WIN32 macro is automatically defined for Cython compatibility. The function adds appropriate include directories and handles the optional nature of extensions for CI builds. """ if extra_source is None: extra_source = [] path = module_name.replace(".", "/") cython_name = path + ".pyx" c_name = path + ".c" define_macros = [] if sys.platform == "win32": # cython uses the macro WIN32 to detect the platform, even though it # should be using something like _WIN32 or MS_WINDOWS, oh well, we can # give it the right value. define_macros.append(("WIN32", None)) if have_cython: source = [cython_name] else: if not os.path.isfile(c_name): unavailable_files.append(c_name) return else: source = [c_name] source.extend(extra_source) include_dirs = ["breezy"] ext_modules.append( Extension( module_name, source, define_macros=define_macros, libraries=libraries, include_dirs=include_dirs, optional=os.environ.get("CIBUILDWHEEL", "0") != "1", ) ) add_cython_extension( "bzrformats._groupcompress_pyx", extra_source=["bzrformats/diff-delta.c"] ) add_cython_extension("bzrformats._knit_load_data_pyx") if sys.platform == "win32": add_cython_extension("bzrformats._dirstate_helpers_pyx", libraries=["Ws2_32"]) else: add_cython_extension("bzrformats._dirstate_helpers_pyx") add_cython_extension("bzrformats._btree_serializer_pyx") if unavailable_files: print("C extension(s) not found:") print(" {}".format("\n ".join(unavailable_files))) print("The python versions will be used instead.") print("") import site site.ENABLE_USER_SITE = "--user" in sys.argv rust_extensions = [ RustExtension( "bzrformats._bzr_rs", "crates/bazaar-py/Cargo.toml", binding=Binding.PyO3 ), RustExtension( "bzrformats._osutils_rs", "crates/osutils-py/Cargo.toml", binding=Binding.PyO3 ), ] entry_points = {} # std setup setup( cmdclass=command_classes, ext_modules=ext_modules, entry_points=entry_points, rust_extensions=rust_extensions, ) bzrformats_3.4.0.orig/bzrformats/.gitignore0000644000000000000000000000016215162073400016034 0ustar00__pycache__ *.pyc build/ *~ *.swp *.swo *.swn .mypy_cache/ .pytest_cache/ dist/ *.egg-info/ bzrformats/_version.pybzrformats_3.4.0.orig/bzrformats/README.md0000644000000000000000000000313415162073400015325 0ustar00# bzrformats Core Bazaar format implementations and utilities extracted from the Breezy project. ## Overview This package contains the internal format implementations and utilities that were part of `breezy.bzr`. These modules provide the core serialization, compression, and data structure functionality for Bazaar version control formats. ## Modules Included ### Serialization Infrastructure - `xml_serializer.py` - Base XML serialization utilities - `xml5.py`, `xml6.py`, `xml7.py`, `xml8.py` - Version-specific XML serialization formats - `chk_serializer.py` - CHK-based inventory serialization ### Utilities - `tuned_gzip.py` - Optimized gzip compression for version control data - `recordcounter.py` - Progress estimation utilities - `_btree_serializer_py.py` - Low-level B+Tree serialization ## Purpose These modules were extracted from Breezy to: 1. Provide reusable format implementations for other projects 2. Create cleaner separation of concerns 3. Enable independent testing and maintenance 4. Offer reference implementations of Bazaar data formats ## Usage This package is primarily intended for use by version control systems and tools that need to work with Bazaar format data. The modules provide building blocks for implementing Bazaar-compatible storage formats. ## License This project is licensed under the GNU General Public License v2 or later (GPLv2+), consistent with the original Bazaar project. ## History These modules were originally part of the Breezy project (https://github.com/breezy-team/breezy) and represent internal implementation details of the Bazaar version control format. bzrformats_3.4.0.orig/bzrformats/__init__.py0000644000000000000000000000557615162235736016207 0ustar00"""Core Bazaar format implementations and utilities. This package contains the internal format implementations and utilities that were extracted from breezy.bzr. These modules provide core serialization, compression, and data structure functionality for Bazaar version control formats. """ # Same format as sys.version_info: "A tuple containing the five components of # the version number: major, minor, micro, releaselevel, and serial. All # values except releaselevel are integers; the release level is 'alpha', # 'beta', 'candidate', or 'final'. The version_info value corresponding to the # Python version 2.0 is (2, 0, 0, 'final', 0)." Additionally we use a # releaselevel of 'dev' for unreleased under-development code. version_info = (3, 4, 0, "final", 0) def _format_version_tuple(version_info): """Turn a version number 2, 3 or 5-tuple into a short string. This format matches and the typical presentation used in Python output. This also checks that the version is reasonable: the sub-release must be zero for final releases. >>> print(_format_version_tuple((1, 0, 0, 'final', 0))) 1.0.0 >>> print(_format_version_tuple((1, 2, 0, 'dev', 0))) 1.2.0.dev >>> print(_format_version_tuple((1, 2, 0, 'dev', 1))) 1.2.0.dev1 >>> print(_format_version_tuple((1, 1, 1, 'candidate', 2))) 1.1.1.rc2 >>> print(_format_version_tuple((2, 1, 0, 'beta', 1))) 2.1.b1 >>> print(_format_version_tuple((1, 4, 0))) 1.4.0 >>> print(_format_version_tuple((1, 4))) 1.4 >>> print(_format_version_tuple((2, 1, 0, 'final', 42))) 2.1.0.42 >>> print(_format_version_tuple((1, 4, 0, 'wibble', 0))) 1.4.0.wibble.0 """ if len(version_info) == 2: main_version = "%d.%d" % version_info[:2] else: main_version = "%d.%d.%d" % version_info[:3] if len(version_info) <= 3: return main_version release_type = version_info[3] sub = version_info[4] if release_type == "final" and sub == 0: sub_string = "" elif release_type == "final": sub_string = "." + str(sub) elif release_type == "dev" and sub == 0: sub_string = ".dev" elif release_type == "dev": sub_string = ".dev" + str(sub) elif release_type in ("alpha", "beta"): if version_info[2] == 0: main_version = "%d.%d" % version_info[:2] sub_string = "." + release_type[0] + str(sub) elif release_type == "candidate": sub_string = ".rc" + str(sub) else: return ".".join(map(str, version_info)) return main_version + sub_string __version__ = _format_version_tuple(version_info) version_string = __version__ _core_version_string = ".".join(map(str, version_info[:3])) __all__ = [ "__version__", "version_info", "version_string", ] from . import _bzr_rs rio = _bzr_rs.rio hashcache = _bzr_rs.hashcache bzrformats_3.4.0.orig/bzrformats/_btree_serializer_py.py0000644000000000000000000000512715162073400020625 0ustar00# Copyright (C) 2008, 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # """B+Tree index parsing.""" def _parse_leaf_lines(data, key_length, ref_list_length): lines = data.split(b"\n") nodes = [] for line in lines[1:]: if line == b"": return nodes elements = line.split(b"\0", key_length) # keys are tuples key = tuple(elements[:key_length]) line = elements[-1] references, value = line.rsplit(b"\0", 1) if ref_list_length: ref_lists = [] for ref_string in references.split(b"\t"): ref_list = tuple( [tuple(ref.split(b"\0")) for ref in ref_string.split(b"\r") if ref] ) ref_lists.append(ref_list) ref_lists = tuple(ref_lists) node_value = (value, ref_lists) else: node_value = (value, ()) nodes.append((key, node_value)) return nodes def _flatten_node(node, reference_lists): """Convert a node into the serialized form. :param node: A tuple representing a node (key_tuple, value, references) :param reference_lists: Does this index have reference lists? :return: (string_key, flattened) string_key The serialized key for referencing this node flattened A string with the serialized form for the contents """ if reference_lists: # TODO: Consider turning this back into the 'unoptimized' nested loop # form. It is probably more obvious for most people, and this is # just a reference implementation. flattened_references = [ b"\r".join([b"\x00".join(reference) for reference in ref_list]) for ref_list in node[3] ] else: flattened_references = [] string_key = b"\x00".join(node[1]) line = b"%s\x00%s\x00%s\n" % (string_key, b"\t".join(flattened_references), node[2]) return string_key, line bzrformats_3.4.0.orig/bzrformats/_btree_serializer_pyx.pyx0000644000000000000000000010351215162073400021202 0ustar00# Copyright (C) 2008, 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # cython: language_level=3 """Pyrex extensions to btree node parsing.""" cdef extern from "python-compat.h": pass from cpython.bytes cimport (PyBytes_AS_STRING, PyBytes_AsString, PyBytes_CheckExact, PyBytes_FromFormat, PyBytes_FromStringAndSize, PyBytes_GET_SIZE, PyBytes_Size) from cpython.list cimport PyList_Append from cpython.mem cimport PyMem_Free, PyMem_Malloc from cpython.object cimport PyObject from cpython.ref cimport Py_INCREF from cpython.tuple cimport (PyTuple_CheckExact, PyTuple_GET_ITEM, PyTuple_GET_SIZE, PyTuple_New, PyTuple_SET_ITEM) from libc.stdlib cimport strtoul, strtoull from libc.string cimport memchr, memcmp, memcpy, strncmp from ._str_helpers cimport (_my_memrchr, safe_interned_string_from_size, safe_string_from_size) import sys cdef class BTreeLeafParser: """Parse the leaf nodes of a BTree index. :ivar data: The PyBytes object containing the uncompressed text for the node. :ivar key_length: An integer describing how many pieces the keys have for this index. :ivar ref_list_length: An integer describing how many references this index contains. :ivar keys: A PyList of keys found in this node. :ivar _cur_str: A pointer to the start of the next line to parse :ivar _end_str: A pointer to the end of bytes :ivar _start: Pointer to the location within the current line while parsing. :ivar _header_found: True when we have parsed the header for this node """ cdef object data cdef int key_length cdef int ref_list_length cdef object keys cdef char * _cur_str cdef char * _end_str # The current start point for parsing cdef char * _start cdef int _header_found def __init__(self, data, key_length, ref_list_length): self.data = data self.key_length = key_length self.ref_list_length = ref_list_length self.keys = [] self._cur_str = NULL self._end_str = NULL self._header_found = 0 # keys are tuples cdef extract_key(self, char * last): """Extract a key. :param last: points at the byte after the last byte permitted for the key. """ cdef char *temp_ptr cdef int loop_counter cdef tuple key key = PyTuple_New(self.key_length) for loop_counter from 0 <= loop_counter < self.key_length: # grab a key segment temp_ptr = memchr(self._start, c'\0', last - self._start) if temp_ptr == NULL: if loop_counter + 1 == self.key_length: # capture to last temp_ptr = last else: # Invalid line failure_string = ("invalid key, wanted segment from " + repr(safe_string_from_size(self._start, last - self._start))) raise AssertionError(failure_string) # capture the key string if (self.key_length == 1 and (temp_ptr - self._start) == 45 and strncmp(self._start, b'sha1:', 5) == 0): key_element = safe_string_from_size(self._start, temp_ptr - self._start) else: key_element = safe_interned_string_from_size(self._start, temp_ptr - self._start) # advance our pointer self._start = temp_ptr + 1 Py_INCREF(key_element) PyTuple_SET_ITEM(key, loop_counter, key_element) return key cdef int process_line(self) except -1: """Process a line in the bytes.""" cdef char *last cdef char *temp_ptr cdef char *ref_ptr cdef char *next_start cdef int loop_counter cdef Py_ssize_t str_len self._start = self._cur_str # Find the next newline last = memchr(self._start, c'\n', self._end_str - self._start) if last == NULL: # Process until the end of the file last = self._end_str self._cur_str = self._end_str else: # And the next string is right after it self._cur_str = last + 1 # The last character is right before the '\n' if last == self._start: # parsed it all. return 0 if last < self._start: # Unexpected error condition - fail raise AssertionError("last < self._start") if 0 == self._header_found: # The first line in a leaf node is the header "type=leaf\n" if strncmp(b"type=leaf", self._start, last - self._start) == 0: self._header_found = 1 return 0 else: raise AssertionError('Node did not start with "type=leaf": %r' % (safe_string_from_size(self._start, last - self._start))) key = self.extract_key(last) # find the value area temp_ptr = _my_memrchr(self._start, c'\0', last - self._start) if temp_ptr == NULL: # Invalid line raise AssertionError("Failed to find the value area") else: # Because of how conversions were done, we ended up with *lots* of # values that are identical. These are all of the 0-length nodes # that are referred to by the TREE_ROOT (and likely some other # directory nodes.) For example, bzr has 25k references to # something like '12607215 328306 0 0', which ends up consuming 1MB # of memory, just for those strings. str_len = last - temp_ptr - 1 if (str_len > 4 and strncmp(b" 0 0", last - 4, 4) == 0): # This drops peak mem for bzr.dev from 87.4MB => 86.2MB # For Launchpad 236MB => 232MB value = safe_interned_string_from_size(temp_ptr + 1, str_len) else: value = safe_string_from_size(temp_ptr + 1, str_len) # shrink the references end point last = temp_ptr if self.ref_list_length: ref_lists = PyTuple_New(self.ref_list_length) loop_counter = 0 while loop_counter < self.ref_list_length: ref_list = [] # extract a reference list loop_counter = loop_counter + 1 if last < self._start: raise AssertionError("last < self._start") # find the next reference list end point: temp_ptr = memchr(self._start, c'\t', last - self._start) if temp_ptr == NULL: # Only valid for the last list if loop_counter != self.ref_list_length: # Invalid line raise AssertionError( "invalid key, loop_counter != self.ref_list_length") else: # scan to the end of the ref list area ref_ptr = last next_start = last else: # scan to the end of this ref list ref_ptr = temp_ptr next_start = temp_ptr + 1 # Now, there may be multiple keys in the ref list. while self._start < ref_ptr: # loop finding keys and extracting them temp_ptr = memchr(self._start, c'\r', ref_ptr - self._start) if temp_ptr == NULL: # key runs to the end temp_ptr = ref_ptr PyList_Append(ref_list, self.extract_key(temp_ptr)) ref_list = tuple(ref_list) Py_INCREF(ref_list) PyTuple_SET_ITEM(ref_lists, loop_counter - 1, ref_list) # prepare for the next reference list self._start = next_start node_value = (value, ref_lists) else: if last != self._start: # unexpected reference data present raise AssertionError("unexpected reference data present") node_value = (value, ()) PyList_Append(self.keys, (key, node_value)) return 0 def parse(self): cdef Py_ssize_t byte_count if not PyBytes_CheckExact(self.data): raise AssertionError('self.data is not a byte string.') byte_count = PyBytes_GET_SIZE(self.data) self._cur_str = PyBytes_AS_STRING(self.data) # This points to the last character in the string self._end_str = self._cur_str + byte_count while self._cur_str < self._end_str: self.process_line() return self.keys def _parse_leaf_lines(data, key_length, ref_list_length): parser = BTreeLeafParser(data, key_length, ref_list_length) return parser.parse() # TODO: We can go from 8 byte offset + 4 byte length to a simple lookup, # because the block_offset + length is likely to be repeated. However, # the big win there is to cache across pages, and not just one page # Though if we did cache in a page, we could certainly use a short int. # And this goes from 40 bytes to 30 bytes. # One slightly ugly option would be to cache block offsets in a global. # However, that leads to thread-safety issues, etc. ctypedef struct gc_chk_sha1_record: unsigned long long block_offset unsigned int block_length unsigned int record_start unsigned int record_end char sha1[20] cdef int _unhexbuf[256] cdef char *_hexbuf _hexbuf = b'0123456789abcdef' cdef _populate_unhexbuf(): cdef int i for i from 0 <= i < 256: _unhexbuf[i] = -1 for i from 0 <= i < 10: # 0123456789 => map to the raw number _unhexbuf[(i + c'0')] = i for i from 10 <= i < 16: # abcdef => 10, 11, 12, 13, 14, 15, 16 _unhexbuf[(i - 10 + c'a')] = i for i from 10 <= i < 16: # ABCDEF => 10, 11, 12, 13, 14, 15, 16 _unhexbuf[(i - 10 + c'A')] = i _populate_unhexbuf() cdef int _unhexlify_sha1(char *as_hex, char *as_bin): # cannot_raise """Take the hex sha1 in as_hex and make it binary in as_bin Same as binascii.unhexlify, but working on C strings, not Python objects. """ cdef int top cdef int bot cdef int i, j cdef char *cur # binascii does this using isupper() and tolower() and ?: syntax. I'm # guessing a simple lookup array should be faster. j = 0 for i from 0 <= i < 20: top = _unhexbuf[(as_hex[j])] j = j + 1 bot = _unhexbuf[(as_hex[j])] j = j + 1 if top == -1 or bot == -1: return 0 as_bin[i] = ((top << 4) + bot); return 1 def _py_unhexlify(as_hex): """For the test infrastructure, just thunks to _unhexlify_sha1""" if not PyBytes_CheckExact(as_hex) or PyBytes_GET_SIZE(as_hex) != 40: raise ValueError('not a 40-byte hex digest') as_bin = PyBytes_FromStringAndSize(NULL, 20) if _unhexlify_sha1(PyBytes_AS_STRING(as_hex), PyBytes_AS_STRING(as_bin)): return as_bin return None cdef void _hexlify_sha1(char *as_bin, char *as_hex): # cannot_raise cdef int i, j cdef char c j = 0 for i from 0 <= i < 20: c = as_bin[i] as_hex[j] = _hexbuf[(c>>4)&0xf] j = j + 1 as_hex[j] = _hexbuf[(c)&0xf] j = j + 1 def _py_hexlify(as_bin): """For test infrastructure, thunk to _hexlify_sha1""" if len(as_bin) != 20 or not PyBytes_CheckExact(as_bin): raise ValueError('not a 20-byte binary digest') as_hex = PyBytes_FromStringAndSize(NULL, 40) _hexlify_sha1(PyBytes_AS_STRING(as_bin), PyBytes_AS_STRING(as_hex)) return as_hex cdef int _key_to_sha1(key, char *sha1): # cannot_raise """Map a key into its sha1 content. :param key: A tuple of style ('sha1:abcd...',) :param sha1: A char buffer of 20 bytes :return: 1 if this could be converted, 0 otherwise """ cdef char *c_val cdef PyObject *p_val if PyTuple_CheckExact(key) and PyTuple_GET_SIZE(key) == 1: p_val = PyTuple_GET_ITEM(key, 0) else: # Not a tuple or a PyTuple return 0 if (PyBytes_CheckExact(p_val) and PyBytes_GET_SIZE(p_val) == 45): c_val = PyBytes_AS_STRING(p_val) else: return 0 if strncmp(c_val, b'sha1:', 5) != 0: return 0 if not _unhexlify_sha1(c_val + 5, sha1): return 0 return 1 def _py_key_to_sha1(key): """Map a key to a simple sha1 string. This is a testing thunk to the C function. """ as_bin_sha = PyBytes_FromStringAndSize(NULL, 20) if _key_to_sha1(key, PyBytes_AS_STRING(as_bin_sha)): return as_bin_sha return None cdef tuple _sha1_to_key(char *sha1): """Compute a ('sha1:abcd',) key for a given sha1.""" cdef tuple key cdef object hexxed cdef char *c_buf hexxed = PyBytes_FromStringAndSize(NULL, 45) c_buf = PyBytes_AS_STRING(hexxed) memcpy(c_buf, b'sha1:', 5) _hexlify_sha1(sha1, c_buf+5) key = PyTuple_New(1) Py_INCREF(hexxed) PyTuple_SET_ITEM(key, 0, hexxed) # This is a bit expensive. To parse 120 keys takes 48us, to return them all # can be done in 66.6us (so 18.6us to build them all). # Adding simple hash() here brings it to 76.6us (so computing the hash # value of 120keys is 10us), Intern is 86.9us (another 10us to look and add # them to the intern structure.) # However, since we only intern keys that are in active use, it is probably # a win. Since they would have been read from elsewhere anyway. # We *could* hang the PyObject form off of the gc_chk_sha1_record for ones # that we have deserialized. Something to think about, at least. return key def _py_sha1_to_key(sha1_bin): """Test thunk to check the sha1 mapping.""" if not PyBytes_CheckExact(sha1_bin) or PyBytes_GET_SIZE(sha1_bin) != 20: raise ValueError('sha1_bin must be a str of exactly 20 bytes') return _sha1_to_key(PyBytes_AS_STRING(sha1_bin)) cdef unsigned int _sha1_to_uint(char *sha1): # cannot_raise cdef unsigned int val # Must be in MSB, because that is how the content is sorted val = ((((sha1[0]) & 0xff) << 24) | (((sha1[1]) & 0xff) << 16) | (((sha1[2]) & 0xff) << 8) | (((sha1[3]) & 0xff) << 0)) return val cdef _format_record(gc_chk_sha1_record *record): # This is inefficient to go from a logical state back to a bytes object, # but it makes things work a bit better internally for now. if record.block_offset >= 0xFFFFFFFF: # Could use %llu which was added to Python 2.7 but it oddly is missing # from the Python 3 equivalent functions, so hack still needed. :( block_offset_str = b'%d' % record.block_offset value = PyBytes_FromFormat( '%s %u %u %u', PyBytes_AS_STRING(block_offset_str), record.block_length, record.record_start, record.record_end) else: value = PyBytes_FromFormat( '%lu %u %u %u', record.block_offset, record.block_length, record.record_start, record.record_end) return value cdef class GCCHKSHA1LeafNode: """Track all the entries for a given leaf node.""" cdef gc_chk_sha1_record *records cdef public object last_key cdef gc_chk_sha1_record *last_record cdef public int num_records # This is the number of bits to shift to get to the interesting byte. A # value of 24 means that the very first byte changes across all keys. # Anything else means that there is a common prefix of bits that we can # ignore. 0 means that at least the first 3 bytes are identical, though # that is going to be very rare cdef public unsigned char common_shift # This maps an interesting byte to the first record that matches. # Equivalent to bisect.bisect_left(self.records, sha1), though only taking # into account that one byte. cdef unsigned char offsets[257] def __sizeof__(self): return ( sizeof(GCCHKSHA1LeafNode) + sizeof(gc_chk_sha1_record) * self.num_records) def __dealloc__(self): if self.records != NULL: PyMem_Free(self.records) self.records = NULL def __init__(self, bytes): self._parse_bytes(bytes) self.last_key = None self.last_record = NULL property min_key: def __get__(self): if self.num_records > 0: return _sha1_to_key(self.records[0].sha1) return None property max_key: def __get__(self): if self.num_records > 0: return _sha1_to_key(self.records[self.num_records-1].sha1) return None cdef tuple _record_to_value_and_refs(self, gc_chk_sha1_record *record): """Extract the refs and value part of this record.""" cdef tuple value_and_refs cdef tuple empty value_and_refs = PyTuple_New(2) value = _format_record(record) Py_INCREF(value) PyTuple_SET_ITEM(value_and_refs, 0, value) # Always empty refs empty = PyTuple_New(0) Py_INCREF(empty) PyTuple_SET_ITEM(value_and_refs, 1, empty) return value_and_refs cdef tuple _record_to_item(self, gc_chk_sha1_record *record): """Turn a given record back into a fully fledged item. """ cdef tuple item cdef tuple key cdef tuple value_and_refs cdef object value key = _sha1_to_key(record.sha1) item = PyTuple_New(2) Py_INCREF(key) PyTuple_SET_ITEM(item, 0, key) value_and_refs = self._record_to_value_and_refs(record) Py_INCREF(value_and_refs) PyTuple_SET_ITEM(item, 1, value_and_refs) return item cdef gc_chk_sha1_record* _lookup_record(self, char *sha1) except? NULL: """Find a gc_chk_sha1_record that matches the sha1 supplied.""" cdef int lo, hi, mid, the_cmp cdef int offset # TODO: We can speed up misses by comparing this sha1 to the common # bits, and seeing if the common prefix matches, if not, we don't # need to search for anything because it cannot match # Use the offset array to find the closest fit for this entry # follow that up with bisecting, since multiple keys can be in one # spot # Bisecting dropped us from 7000 comparisons to 582 (4.8/key), using # the offset array dropped us from 23us to 20us and 156 comparisions # (1.3/key) offset = self._offset_for_sha1(sha1) lo = self.offsets[offset] hi = self.offsets[offset+1] if hi == 255: # if hi == 255 that means we potentially ran off the end of the # list, so push it up to num_records # note that if 'lo' == 255, that is ok, because we can start # searching from that part of the list. hi = self.num_records local_n_cmp = 0 while lo < hi: mid = (lo + hi) // 2 the_cmp = memcmp(self.records[mid].sha1, sha1, 20) if the_cmp == 0: return &self.records[mid] elif the_cmp < 0: lo = mid + 1 else: hi = mid return NULL def __contains__(self, key): cdef char sha1[20] cdef gc_chk_sha1_record *record if _key_to_sha1(key, sha1): # If it isn't a sha1 key, then it won't be in this leaf node record = self._lookup_record(sha1) if record != NULL: self.last_key = key self.last_record = record return True return False def __getitem__(self, key): cdef char sha1[20] cdef gc_chk_sha1_record *record record = NULL if self.last_record != NULL and key is self.last_key: record = self.last_record elif _key_to_sha1(key, sha1): record = self._lookup_record(sha1) if record == NULL: raise KeyError('key %r is not present' % (key,)) return self._record_to_value_and_refs(record) def __len__(self): return self.num_records def all_keys(self): cdef int i result = [] for i from 0 <= i < self.num_records: PyList_Append(result, _sha1_to_key(self.records[i].sha1)) return result def all_items(self): cdef int i result = [] for i from 0 <= i < self.num_records: item = self._record_to_item(&self.records[i]) PyList_Append(result, item) return result cdef int _count_records(self, char *c_content, char *c_end): # cannot_raise """Count how many records are in this section.""" cdef char *c_cur cdef int num_records c_cur = c_content num_records = 0 while c_cur != NULL and c_cur < c_end: c_cur = memchr(c_cur, c'\n', c_end - c_cur); if c_cur == NULL: break c_cur = c_cur + 1 num_records = num_records + 1 return num_records cdef _parse_bytes(self, data): """Parse the bytes 'data' into content.""" cdef char *c_bytes cdef char *c_cur cdef char *c_end cdef Py_ssize_t n_bytes cdef int num_records cdef int entry cdef gc_chk_sha1_record *cur_record if not PyBytes_CheckExact(data): raise TypeError('We only support parsing byte strings.') # Pass 1, count how many records there will be n_bytes = PyBytes_GET_SIZE(data) c_bytes = PyBytes_AS_STRING(data) c_end = c_bytes + n_bytes if strncmp(c_bytes, b'type=leaf\n', 10): raise ValueError("bytes did not start with 'type=leaf\\n': %r" % (data[:10],)) c_cur = c_bytes + 10 num_records = self._count_records(c_cur, c_end) # Now allocate the memory for these items, and go to town self.records = PyMem_Malloc(num_records * (sizeof(unsigned short) + sizeof(gc_chk_sha1_record))) self.num_records = num_records cur_record = self.records entry = 0 while c_cur != NULL and c_cur < c_end and entry < num_records: c_cur = self._parse_one_entry(c_cur, c_end, cur_record) cur_record = cur_record + 1 entry = entry + 1 if (entry != self.num_records or c_cur != c_end or cur_record != self.records + self.num_records): raise ValueError('Something went wrong while parsing.') # Pass 3: build the offset map self._compute_common() cdef char *_parse_one_entry(self, char *c_cur, char *c_end, gc_chk_sha1_record *cur_record) except NULL: """Read a single sha record from the bytes. :param c_cur: The pointer to the start of bytes :param cur_record: Record to populate """ cdef char *c_next if strncmp(c_cur, 'sha1:', 5): raise ValueError('line did not start with sha1: %r' % (safe_string_from_size(c_cur, 10),)) c_cur = c_cur + 5 c_next = memchr(c_cur, c'\0', c_end - c_cur) if c_next == NULL or (c_next - c_cur != 40): raise ValueError('Line did not contain 40 hex bytes') if not _unhexlify_sha1(c_cur, cur_record.sha1): raise ValueError('We failed to unhexlify') c_cur = c_next + 1 if c_cur[0] != c'\0': raise ValueError('only 1 null, not 2 as expected') c_cur = c_cur + 1 cur_record.block_offset = strtoull(c_cur, &c_next, 10) if c_cur == c_next or c_next[0] != c' ': raise ValueError('Failed to parse block offset') c_cur = c_next + 1 cur_record.block_length = strtoul(c_cur, &c_next, 10) if c_cur == c_next or c_next[0] != c' ': raise ValueError('Failed to parse block length') c_cur = c_next + 1 cur_record.record_start = strtoul(c_cur, &c_next, 10) if c_cur == c_next or c_next[0] != c' ': raise ValueError('Failed to parse block length') c_cur = c_next + 1 cur_record.record_end = strtoul(c_cur, &c_next, 10) if c_cur == c_next or c_next[0] != c'\n': raise ValueError('Failed to parse record end') c_cur = c_next + 1 return c_cur cdef int _offset_for_sha1(self, char *sha1) except -1: """Find the first interesting 8-bits of this sha1.""" cdef int this_offset cdef unsigned int as_uint as_uint = _sha1_to_uint(sha1) this_offset = (as_uint >> self.common_shift) & 0xFF return this_offset def _get_offset_for_sha1(self, sha1): return self._offset_for_sha1(PyBytes_AS_STRING(sha1)) cdef _compute_common(self): cdef unsigned int first cdef unsigned int this cdef unsigned int common_mask cdef unsigned char common_shift cdef int i cdef int offset, this_offset cdef int max_offset # The idea with the offset map is that we should be able to quickly # jump to the key that matches a gives sha1. We know that the keys are # in sorted order, and we know that a lot of the prefix is going to be # the same across them. # By XORing the records together, we can determine what bits are set in # all of them if self.num_records < 2: # Everything is in common if you have 0 or 1 leaves # So we'll always just shift to the first byte self.common_shift = 24 else: common_mask = 0xFFFFFFFF first = _sha1_to_uint(self.records[0].sha1) for i from 0 < i < self.num_records: this = _sha1_to_uint(self.records[i].sha1) common_mask = (~(first ^ this)) & common_mask common_shift = 24 while common_mask & 0x80000000 and common_shift > 0: common_mask = common_mask << 1 common_shift = common_shift - 1 self.common_shift = common_shift offset = 0 max_offset = self.num_records # We cap this loop at 254 records. All the other offsets just get # filled with 0xff as the singleton saying 'too many'. # It means that if we have >255 records we have to bisect the second # half of the list, but this is going to be very rare in practice. if max_offset > 255: max_offset = 255 for i from 0 <= i < max_offset: this_offset = self._offset_for_sha1(self.records[i].sha1) while offset <= this_offset: self.offsets[offset] = i offset = offset + 1 while offset < 257: self.offsets[offset] = max_offset offset = offset + 1 def _get_offsets(self): cdef int i result = [] for i from 0 <= i < 257: PyList_Append(result, self.offsets[i]) return result def _parse_into_chk(bytes, key_length, ref_list_length): """Parse into a format optimized for chk records.""" assert key_length == 1 assert ref_list_length == 0 return GCCHKSHA1LeafNode(bytes) def _flatten_node(node, reference_lists): """Convert a node into the serialized form. :param node: A tuple representing a node: (index, key_tuple, value, references) :param reference_lists: Does this index have reference lists? :return: (string_key, flattened) string_key The serialized key for referencing this node flattened A string with the serialized form for the contents """ cdef int have_reference_lists cdef Py_ssize_t flat_len cdef Py_ssize_t key_len cdef Py_ssize_t node_len cdef char * value cdef Py_ssize_t value_len cdef char * out cdef Py_ssize_t refs_len cdef Py_ssize_t next_len cdef int first_ref_list cdef int first_reference cdef int i cdef Py_ssize_t ref_bit_len if not PyTuple_CheckExact(node): raise TypeError('We expected a tuple() for node not: %s' % type(node)) node_len = len(node) have_reference_lists = reference_lists if have_reference_lists: if node_len != 4: raise ValueError('With ref_lists, we expected 4 entries not: %s' % len(node)) elif node_len < 3: raise ValueError('Without ref_lists, we need at least 3 entries not: %s' % len(node)) # TODO: We can probably do better than string.join(), namely # when key has only 1 item, we can just grab that string # And when there are 2 items, we could do a single malloc + len() + 1 # also, doing .join() requires a PyObject_GetAttrString call, which # we could also avoid. # TODO: Note that pyrex 0.9.6 generates fairly crummy code here, using the # python object interface, versus 0.9.8+ which uses a helper that # checks if this supports the sequence interface. # We *could* do more work on our own, and grab the actual items # lists. For now, just ask people to use a better compiler. :) string_key = b'\0'.join(node[1]) # TODO: instead of using string joins, precompute the final string length, # and then malloc a single string and copy everything in. # TODO: We probably want to use PySequenceFast, because we have lists and # tuples, but we aren't sure which we will get. # line := string_key NULL flat_refs NULL value LF # string_key := BYTES (NULL BYTES)* # flat_refs := ref_list (TAB ref_list)* # ref_list := ref (CR ref)* # ref := BYTES (NULL BYTES)* # value := BYTES refs_len = 0 if have_reference_lists: # Figure out how many bytes it will take to store the references ref_lists = node[3] next_len = len(ref_lists) # TODO: use a Py function if next_len > 0: # If there are no nodes, we don't need to do any work # Otherwise we will need (len - 1) '\t' characters to separate # the reference lists refs_len = refs_len + (next_len - 1) for ref_list in ref_lists: next_len = len(ref_list) if next_len > 0: # We will need (len - 1) '\r' characters to separate the # references refs_len = refs_len + (next_len - 1) for reference in ref_list: if not PyTuple_CheckExact(reference): raise TypeError( 'We expect references to be tuples not: %r' % type(reference)) next_len = len(reference) if next_len > 0: # We will need (len - 1) '\x00' characters to # separate the reference key refs_len = refs_len + (next_len - 1) for ref_bit in reference: if not PyBytes_CheckExact(ref_bit): raise TypeError( 'We expect reference bits to be bytes' ' not: %r' % type(ref_bit)) refs_len = refs_len + PyBytes_GET_SIZE(ref_bit) # So we have the (key NULL refs NULL value LF) key_len = PyBytes_Size(string_key) val = node[2] if not PyBytes_CheckExact(val): raise TypeError('Expected bytes for value not: %r' % type(val)) value = PyBytes_AS_STRING(val) value_len = PyBytes_GET_SIZE(val) flat_len = (key_len + 1 + refs_len + 1 + value_len + 1) line = PyBytes_FromStringAndSize(NULL, flat_len) # Get a pointer to the new buffer out = PyBytes_AsString(line) memcpy(out, PyBytes_AsString(string_key), key_len) out = out + key_len out[0] = c'\0' out = out + 1 if refs_len > 0: first_ref_list = 1 for ref_list in ref_lists: if first_ref_list == 0: out[0] = c'\t' out = out + 1 first_ref_list = 0 first_reference = 1 for reference in ref_list: if first_reference == 0: out[0] = c'\r' out = out + 1 first_reference = 0 next_len = len(reference) for i from 0 <= i < next_len: if i != 0: out[0] = c'\x00' out = out + 1 ref_bit = reference[i] ref_bit_len = PyBytes_GET_SIZE(ref_bit) memcpy(out, PyBytes_AS_STRING(ref_bit), ref_bit_len) out = out + ref_bit_len out[0] = c'\0' out = out + 1 memcpy(out, value, value_len) out = out + value_len out[0] = c'\n' return string_key, line bzrformats_3.4.0.orig/bzrformats/_dirstate_helpers_py.py0000644000000000000000000001313415162073400020631 0ustar00# Copyright (C) 2007, 2008 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Python implementations of Dirstate Helper functions.""" # We cannot import the dirstate module, because it loads this module def _read_dirblocks(state): """Read in the dirblocks for the given DirState object. This is tightly bound to the DirState internal representation. It should be thought of as a member function, which is only separated out so that we can re-write it in pyrex. :param state: A DirState object. :return: None """ from .dirstate import DirState, DirstateCorrupt, _fields_per_entry state._state_file.seek(state._end_of_header) text = state._state_file.read() # TODO: check the crc checksums. crc_measured = zlib.crc32(text) fields = text.split(b"\0") # Remove the last blank entry trailing = fields.pop() if trailing != b"": raise DirstateCorrupt(state, f"trailing garbage: {trailing!r}") # consider turning fields into a tuple. # skip the first field which is the trailing null from the header. cur = 1 # Each line now has an extra '\n' field which is not used # so we just skip over it # entry size: # 3 fields for the key # + number of fields per tree_data (5) * tree count # + newline num_present_parents = state._num_present_parents() 1 + num_present_parents entry_size = _fields_per_entry(num_present_parents) expected_field_count = entry_size * state._num_entries field_count = len(fields) # this checks our adjustment, and also catches file too short. if field_count - cur != expected_field_count: raise DirstateCorrupt( state, "field count incorrect {} != {}, entry_size={}, " "num_entries={} fields={!r}".format( field_count - cur, expected_field_count, entry_size, state._num_entries, fields, ), ) if num_present_parents == 1: # Bind external functions to local names _int = int # We access all fields in order, so we can just iterate over # them. Grab an straight iterator over the fields. (We use an # iterator because we don't want to do a lot of additions, nor # do we want to do a lot of slicing) _iter = iter(fields) # Get a local reference to the compatible next method next = getattr(_iter, "__next__", None) if next is None: next = _iter.next # Move the iterator to the current position for _x in range(cur): next() # The two blocks here are deliberate: the root block and the # contents-of-root block. state._dirblocks = [(b"", []), (b"", [])] current_block = state._dirblocks[0][1] current_dirname = b"" append_entry = current_block.append for _count in range(state._num_entries): dirname = next() name = next() file_id = next() if dirname != current_dirname: # new block - different dirname current_block = [] current_dirname = dirname state._dirblocks.append((current_dirname, current_block)) append_entry = current_block.append # we know current_dirname == dirname, so re-use it to avoid # creating new strings entry = ( (current_dirname, name, file_id), [ ( # Current Tree next(), # minikind next(), # fingerprint _int(next()), # size next() == b"y", # executable next(), # packed_stat or revision_id ), ( # Parent 1 next(), # minikind next(), # fingerprint _int(next()), # size next() == b"y", # executable next(), # packed_stat or revision_id ), ], ) trailing = next() if trailing != b"\n": raise ValueError(f"trailing garbage in dirstate: {trailing!r}") # append the entry to the current block append_entry(entry) state._split_root_dirblock_into_contents() else: fields_to_entry = state._get_fields_to_entry() entries = [ fields_to_entry(fields[pos : pos + entry_size]) for pos in range(cur, field_count, entry_size) ] state._entries_to_current_state(entries) # To convert from format 2 => format 3 # state._dirblocks = sorted(state._dirblocks, # key=lambda blk:blk[0].split('/')) # To convert from format 3 => format 2 # state._dirblocks = sorted(state._dirblocks) state._dirblock_state = DirState.IN_MEMORY_UNMODIFIED bzrformats_3.4.0.orig/bzrformats/_dirstate_helpers_pyx.pyx0000644000000000000000000023200215162074037021215 0ustar00# Copyright (C) 2007-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # cython: language_level=3 """Helper functions for DirState. This is the python implementation for DirState functions. """ import binascii import bisect import codecs import errno import os import stat import sys from .errors import BzrFormatsError, BadFileKindError from . import osutils from .osutils import (is_inside, is_inside_any, parent_directories, pathjoin, splitpath, file_kind_from_stat_mode, sha_file, _walkdirs_utf8) from .dirstate import DirstateInventoryChange # Delay import to avoid circular dependency DirstateCorrupt = None DirState = None def _ensure_dirstate_import(): global DirstateCorrupt, DirState if DirstateCorrupt is None: from .dirstate import DirstateCorrupt as DC, DirState as DS DirstateCorrupt = DC DirState = DS from cpython.tuple cimport PyTuple_New, PyTuple_SET_ITEM # This is the Windows equivalent of ENOTDIR # It is defined in pywin32.winerror, but we don't want a strong dependency for # just an error code. # XXX: Perhaps we could get it from a windows header ? cdef int ERROR_PATH_NOT_FOUND ERROR_PATH_NOT_FOUND = 3 cdef int ERROR_DIRECTORY ERROR_DIRECTORY = 267 cdef extern from "python-compat.h": unsigned long htonl(unsigned long) # Give Pyrex some function definitions for it to understand. # All of these are just hints to Pyrex, so that it can try to convert python # objects into similar C objects. (such as PyInt => int). # In anything defined 'cdef extern from XXX' the real C header will be # imported, and the real definition will be used from there. So these are just # hints, and do not need to match exactly to the C definitions. cdef extern from *: ctypedef unsigned long size_t cdef extern from "stdint.h": ctypedef long intptr_t cdef extern from "stdlib.h": unsigned long int strtoul(char *nptr, char **endptr, int base) cdef extern from 'sys/stat.h': int S_ISDIR(int mode) int S_ISREG(int mode) # On win32, this actually comes from "python-compat.h" int S_ISLNK(int mode) int S_IXUSR # These functions allow us access to a bit of the 'bare metal' of python # objects, rather than going through the object abstraction. (For example, # PyList_Append, rather than getting the 'append' attribute of the object, and # creating a tuple, and then using PyCallObject). # Functions that return (or take) a void* are meant to grab a C PyObject*. This # differs from the Pyrex 'object'. If you declare a variable as 'object' Pyrex # will automatically Py_INCREF and Py_DECREF when appropriate. But for some # inner loops, we don't need to do that at all, as the reference only lasts for # a very short time. # Note that the C API GetItem calls borrow references, so pyrex does the wrong # thing if you declare e.g. object PyList_GetItem(object lst, int index) - you # need to manually Py_INCREF yourself. cdef extern from "Python.h": ctypedef int Py_ssize_t ctypedef struct PyObject: pass int PyList_Append(object lst, object item) except -1 void *PyList_GetItem_object_void "PyList_GET_ITEM" (object lst, int index) void *PyList_GetItem_void_void "PyList_GET_ITEM" (void * lst, int index) object PyList_GET_ITEM(object lst, Py_ssize_t index) int PyList_CheckExact(object) Py_ssize_t PyList_GET_SIZE (object p) void *PyTuple_GetItem_void_void "PyTuple_GET_ITEM" (void* tpl, int index) object PyTuple_GetItem_void_object "PyTuple_GET_ITEM" (void* tpl, int index) object PyTuple_GET_ITEM(object tpl, Py_ssize_t index) unsigned long PyLong_AsUnsignedLongMask(object number) except? -1 char *PyBytes_AsString(object p) char *PyBytes_AsString_obj "PyBytes_AsString" (PyObject *string) char *PyBytes_AS_STRING_void "PyBytes_AS_STRING" (void *p) int PyBytes_AsStringAndSize(object str, char **buffer, Py_ssize_t *length) except -1 object PyBytes_FromString(char *) object PyBytes_FromStringAndSize(char *, Py_ssize_t) int PyBytes_Size(object p) int PyBytes_GET_SIZE_void "PyBytes_GET_SIZE" (void *p) int PyBytes_CheckExact(object p) int PyFloat_Check(object p) double PyFloat_AsDouble(object p) int PyLong_Check(object p) void Py_INCREF(object o) void Py_DECREF(object o) cdef extern from "string.h": int strncmp(char *s1, char *s2, int len) void *memchr(void *s, int c, size_t len) int memcmp(void *b1, void *b2, size_t len) from ._str_helpers cimport _my_memrchr, safe_string_from_size cdef int _is_aligned(void *ptr): # cannot_raise """Is this pointer aligned to an integer size offset? :return: 1 if this pointer is aligned, 0 otherwise. """ return ((ptr) & ((sizeof(int))-1)) == 0 cdef int _cmp_by_dirs(char *path1, int size1, char *path2, int size2): # cannot_raise cdef unsigned char *cur1 cdef unsigned char *cur2 cdef unsigned char *end1 cdef unsigned char *end2 cdef int *cur_int1 cdef int *cur_int2 cdef int *end_int1 cdef int *end_int2 if path1 == path2 and size1 == size2: return 0 end1 = path1+size1 end2 = path2+size2 # Use 32-bit comparisons for the matching portion of the string. # Almost all CPU's are faster at loading and comparing 32-bit integers, # than they are at 8-bit integers. # 99% of the time, these will be aligned, but in case they aren't just skip # this loop if _is_aligned(path1) and _is_aligned(path2): cur_int1 = path1 cur_int2 = path2 end_int1 = (path1 + size1 - (size1 % sizeof(int))) end_int2 = (path2 + size2 - (size2 % sizeof(int))) while cur_int1 < end_int1 and cur_int2 < end_int2: if cur_int1[0] != cur_int2[0]: break cur_int1 = cur_int1 + 1 cur_int2 = cur_int2 + 1 cur1 = cur_int1 cur2 = cur_int2 else: cur1 = path1 cur2 = path2 while cur1 < end1 and cur2 < end2: if cur1[0] == cur2[0]: # This character matches, just go to the next one cur1 = cur1 + 1 cur2 = cur2 + 1 continue # The current characters do not match if cur1[0] == b'/': return -1 # Reached the end of path1 segment first elif cur2[0] == b'/': return 1 # Reached the end of path2 segment first elif cur1[0] < cur2[0]: return -1 else: return 1 # We reached the end of at least one of the strings if cur1 < end1: return 1 # Not at the end of cur1, must be at the end of cur2 if cur2 < end2: return -1 # At the end of cur1, but not at cur2 # We reached the end of both strings return 0 cdef class Reader: """Maintain the current location, and return fields as you parse them.""" cdef object state # The DirState object cdef object text # The overall string object cdef char *text_cstr # Pointer to the beginning of text cdef int text_size # Length of text cdef char *end_cstr # End of text cdef char *cur_cstr # Pointer to the current record cdef char *next # Pointer to the end of this record def __init__(self, text, state): _ensure_dirstate_import() self.state = state self.text = text self.text_cstr = PyBytes_AsString(text) self.text_size = PyBytes_Size(text) self.end_cstr = self.text_cstr + self.text_size self.cur_cstr = self.text_cstr cdef char *get_next(self, int *size) except NULL: """Return a pointer to the start of the next field.""" cdef char *next cdef Py_ssize_t extra_len if self.cur_cstr == NULL: raise AssertionError('get_next() called when cur_str is NULL') elif self.cur_cstr >= self.end_cstr: raise AssertionError('get_next() called when there are no chars' ' left') next = self.cur_cstr self.cur_cstr = memchr(next, b'\0', self.end_cstr - next) if self.cur_cstr == NULL: extra_len = self.end_cstr - next raise DirstateCorrupt(self.state, 'failed to find trailing NULL (\\0).' ' Trailing garbage: %r' % safe_string_from_size(next, extra_len)) size[0] = self.cur_cstr - next self.cur_cstr = self.cur_cstr + 1 return next cdef object get_next_str(self): """Get the next field as a Python string.""" cdef int size cdef char *next next = self.get_next(&size) return safe_string_from_size(next, size) cdef int _init(self) except -1: """Get the pointer ready. This assumes that the dirstate header has already been read, and we already have the dirblock string loaded into memory. This just initializes our memory pointers, etc for parsing of the dirblock string. """ cdef char *first cdef int size # The first field should be an empty string left over from the Header first = self.get_next(&size) if first[0] != b'\0' and size == 0: raise AssertionError('First character should be null not: %s' % (first,)) return 0 cdef object _get_entry(self, int num_trees, void **p_current_dirname, int *new_block): """Extract the next entry. This parses the next entry based on the current location in ``self.cur_cstr``. Each entry can be considered a "row" in the total table. And each row has a fixed number of columns. It is generally broken up into "key" columns, then "current" columns, and then "parent" columns. :param num_trees: How many parent trees need to be parsed :param p_current_dirname: A pointer to the current PyBytes representing the directory name. We pass this in as a void * so that pyrex doesn't have to increment/decrement the PyObject reference counter for each _get_entry call. We use a pointer so that _get_entry can update it with the new value. :param new_block: This is to let the caller know that it needs to create a new directory block to store the next entry. """ cdef tuple path_name_file_id_key cdef tuple tmp cdef char *entry_size_cstr cdef unsigned long int entry_size cdef char* executable_cstr cdef int is_executable cdef char* dirname_cstr cdef char* trailing cdef int cur_size cdef int i cdef object minikind cdef object fingerprint cdef object info # Read the 'key' information (dirname, name, file_id) dirname_cstr = self.get_next(&cur_size) # Check to see if we have started a new directory block. # If so, then we need to create a new dirname PyBytes, so that it can # be used in all of the tuples. This saves time and memory, by re-using # the same object repeatedly. # Do the cheap 'length of string' check first. If the string is a # different length, then we *have* to be a different directory. if (cur_size != PyBytes_GET_SIZE_void(p_current_dirname[0]) or strncmp(dirname_cstr, # Extract the char* from our current dirname string. We # know it is a PyBytes, so we can use # PyBytes_AS_STRING, we use the _void version because # we are tricking Pyrex by using a void* rather than an # PyBytes_AS_STRING_void(p_current_dirname[0]), cur_size+1) != 0): dirname = safe_string_from_size(dirname_cstr, cur_size) p_current_dirname[0] = dirname new_block[0] = 1 else: new_block[0] = 0 # Build up the key that will be used. # By using (void *) Pyrex will automatically handle the # Py_INCREF that we need. cur_dirname = p_current_dirname[0] tmp = PyTuple_New(3) Py_INCREF(cur_dirname); PyTuple_SET_ITEM(tmp, 0, cur_dirname) cur_basename = self.get_next_str() cur_file_id = self.get_next_str() Py_INCREF(cur_basename); PyTuple_SET_ITEM(tmp, 1, cur_basename) Py_INCREF(cur_file_id); PyTuple_SET_ITEM(tmp, 2, cur_file_id) path_name_file_id_key = tmp # Parse all of the per-tree information. current has the information in # the same location as parent trees. The only difference is that 'info' # is a 'packed_stat' for current, while it is a 'revision_id' for # parent trees. # minikind, fingerprint, and info will be returned as regular python # strings # entry_size and is_executable will be parsed into a python Long and # python Boolean, respectively. # TODO: jam 20070718 Consider changin the entry_size conversion to # prefer python Int when possible. They are generally faster to # work with, and it will be rare that we have a file >2GB. # Especially since this code is pretty much fixed at a max of # 4GB. trees = [] for i from 0 <= i < num_trees: minikind = self.get_next_str() fingerprint = self.get_next_str() entry_size_cstr = self.get_next(&cur_size) entry_size = strtoul(entry_size_cstr, NULL, 10) executable_cstr = self.get_next(&cur_size) is_executable = (executable_cstr[0] == b'y') info = self.get_next_str() PyList_Append(trees, ( minikind, # minikind fingerprint, # fingerprint entry_size, # size is_executable,# executable info, # packed_stat or revision_id )) # The returned tuple is (key, [trees]) ret = (path_name_file_id_key, trees) # Ignore the trailing newline, but assert that it does exist, this # ensures that we always finish parsing a line on an end-of-entry # marker. trailing = self.get_next(&cur_size) if cur_size != 1 or not trailing.startswith(b'\n'): raise DirstateCorrupt(self.state, 'Bad parse, we expected to end on \\n, not: %d %s: %s' % (cur_size, safe_string_from_size(trailing, cur_size), ret)) return ret def _parse_dirblocks(self): """Parse all dirblocks in the state file.""" cdef int num_trees cdef object current_block cdef object entry cdef void * current_dirname cdef int new_block cdef int expected_entry_count cdef int entry_count num_trees = self.state._num_present_parents() + 1 expected_entry_count = self.state._num_entries # Ignore the first record self._init() current_block = [] dirblocks = [(b'', current_block), (b'', [])] self.state._dirblocks = dirblocks obj = b'' current_dirname = obj new_block = 0 entry_count = 0 # TODO: jam 2007-05-07 Consider pre-allocating some space for the # members, and then growing and shrinking from there. If most # directories have close to 10 entries in them, it would save a # few mallocs if we default our list size to something # reasonable. Or we could malloc it to something large (100 or # so), and then truncate. That would give us a malloc + realloc, # rather than lots of reallocs. while self.cur_cstr < self.end_cstr: entry = self._get_entry(num_trees, ¤t_dirname, &new_block) if new_block: # new block - different dirname current_block = [] PyList_Append(dirblocks, (current_dirname, current_block)) PyList_Append(current_block, entry) entry_count = entry_count + 1 if entry_count != expected_entry_count: raise DirstateCorrupt(self.state, 'We read the wrong number of entries.' ' We expected to read %s, but read %s' % (expected_entry_count, entry_count)) self.state._split_root_dirblock_into_contents() def _read_dirblocks(state): """Read in the dirblocks for the given DirState object. This is tightly bound to the DirState internal representation. It should be thought of as a member function, which is only separated out so that we can re-write it in pyrex. :param state: A DirState object. :return: None :postcondition: The dirblocks will be loaded into the appropriate fields in the DirState object. """ state._state_file.seek(state._end_of_header) text = state._state_file.read() # TODO: check the crc checksums. crc_measured = zlib.crc32(text) reader = Reader(text, state) reader._parse_dirblocks() state._dirblock_state = DirState.IN_MEMORY_UNMODIFIED cdef int minikind_from_mode(int mode): # cannot_raise # in order of frequency: if S_ISREG(mode): return c"f" if S_ISDIR(mode): return c"d" if S_ISLNK(mode): return c"l" return 0 _encode = binascii.b2a_base64 cdef unsigned long _time_to_unsigned(object t): # cannot_raise if PyFloat_Check(t): t = t.__int__() return PyLong_AsUnsignedLongMask(t) cdef _pack_stat(stat_value): """return a string representing the stat value's key fields. :param stat_value: A stat oject with st_size, st_mtime, st_ctime, st_dev, st_ino and st_mode fields. """ cdef char result[6*4] # 6 long ints cdef int *aliased aliased = result aliased[0] = htonl(PyLong_AsUnsignedLongMask(stat_value.st_size)) # mtime and ctime will often be floats but get converted to PyInt within aliased[1] = htonl(_time_to_unsigned(stat_value.st_mtime)) aliased[2] = htonl(_time_to_unsigned(stat_value.st_ctime)) aliased[3] = htonl(PyLong_AsUnsignedLongMask(stat_value.st_dev)) aliased[4] = htonl(PyLong_AsUnsignedLongMask(stat_value.st_ino)) aliased[5] = htonl(PyLong_AsUnsignedLongMask(stat_value.st_mode)) packed = PyBytes_FromStringAndSize(result, 6*4) return _encode(packed)[:-1] cpdef update_entry(self, entry, abspath, stat_value): """Update the entry based on what is actually on disk. This function only calculates the sha if it needs to - if the entry is uncachable, or clearly different to the first parent's entry, no sha is calculated, and None is returned. :param self: The dirstate object this is operating on. :param entry: This is the dirblock entry for the file in question. :param abspath: The path on disk for this file. :param stat_value: The stat value done on the path. :return: None, or The sha1 hexdigest of the file (40 bytes) or link target of a symlink. """ # TODO - require pyrex 0.9.8, then use a pyd file to define access to the # _st mode of the compiled stat objects. cdef int minikind, saved_minikind cdef void * details cdef int worth_saving minikind = minikind_from_mode(stat_value.st_mode) if 0 == minikind: return None packed_stat = _pack_stat(stat_value) details = PyList_GetItem_void_void(PyTuple_GetItem_void_void(entry, 1), 0) saved_minikind = PyBytes_AsString_obj(PyTuple_GetItem_void_void(details, 0))[0] if minikind == b'd' and saved_minikind == b't': minikind = b't' saved_link_or_sha1 = PyTuple_GetItem_void_object(details, 1) saved_file_size = PyTuple_GetItem_void_object(details, 2) saved_executable = PyTuple_GetItem_void_object(details, 3) saved_packed_stat = PyTuple_GetItem_void_object(details, 4) # Deal with pyrex decrefing the objects Py_INCREF(saved_link_or_sha1) Py_INCREF(saved_file_size) Py_INCREF(saved_executable) Py_INCREF(saved_packed_stat) #(saved_minikind, saved_link_or_sha1, saved_file_size, # saved_executable, saved_packed_stat) = entry[1][0] if (minikind == saved_minikind and packed_stat == saved_packed_stat): # The stat hasn't changed since we saved, so we can re-use the # saved sha hash. if minikind == b'd': return None # size should also be in packed_stat if saved_file_size == stat_value.st_size: return saved_link_or_sha1 # If we have gotten this far, that means that we need to actually # process this entry. link_or_sha1 = None worth_saving = 1 if minikind == b'f': executable = self._is_executable(stat_value.st_mode, saved_executable) if self._cutoff_time is None: self._sha_cutoff_time() if (stat_value.st_mtime < self._cutoff_time and stat_value.st_ctime < self._cutoff_time and len(entry[1]) > 1 and entry[1][1][0] != b'a'): # Could check for size changes for further optimised # avoidance of sha1's. However the most prominent case of # over-shaing is during initial add, which this catches. link_or_sha1 = self._sha1_file(abspath) entry[1][0] = (b'f', link_or_sha1, stat_value.st_size, executable, packed_stat) else: # This file is not worth caching the sha1. Either it is too new, or # it is newly added. Regardless, the only things we are changing # are derived from the stat, and so are not worth caching. So we do # *not* set the IN_MEMORY_MODIFIED flag. (But we'll save the # updated values if there is *other* data worth saving.) entry[1][0] = (b'f', b'', stat_value.st_size, executable, DirState.NULLSTAT) worth_saving = 0 elif minikind == b'd': entry[1][0] = (b'd', b'', 0, False, packed_stat) if saved_minikind != b'd': # This changed from something into a directory. Make sure we # have a directory block for it. This doesn't happen very # often, so this doesn't have to be super fast. block_index, entry_index, dir_present, file_present = \ self._get_block_entry_index(entry[0][0], entry[0][1], 0) self._ensure_block(block_index, entry_index, pathjoin(entry[0][0], entry[0][1])) else: # Any changes are derived trivially from the stat object, not worth # re-writing a dirstate for just this worth_saving = 0 elif minikind == b'l': if saved_minikind == b'l': # If the object hasn't changed kind, it isn't worth saving the # dirstate just for a symlink. The default is 'fast symlinks' which # save the target in the inode entry, rather than separately. So to # stat, we've already read everything off disk. worth_saving = 0 link_or_sha1 = self._read_link(abspath, saved_link_or_sha1) if self._cutoff_time is None: self._sha_cutoff_time() if (stat_value.st_mtime < self._cutoff_time and stat_value.st_ctime < self._cutoff_time): entry[1][0] = (b'l', link_or_sha1, stat_value.st_size, False, packed_stat) else: entry[1][0] = (b'l', b'', stat_value.st_size, False, DirState.NULLSTAT) if worth_saving: # Note, even though _mark_modified will only set # IN_MEMORY_HASH_MODIFIED, it still isn't worth self._mark_modified([entry]) return link_or_sha1 # TODO: Do we want to worry about exceptions here? cdef char _minikind_from_string(object string) except? -1: """Convert a python string to a char.""" return PyBytes_AsString(string)[0] cdef object _kind_absent cdef object _kind_file cdef object _kind_directory cdef object _kind_symlink cdef object _kind_relocated cdef object _kind_tree_reference _kind_absent = "absent" _kind_file = "file" _kind_directory = "directory" _kind_symlink = "symlink" _kind_relocated = "relocated" _kind_tree_reference = "tree-reference" cdef object _minikind_to_kind(char minikind): """Create a string kind for minikind.""" cdef char _minikind[1] if minikind == b'f': return _kind_file elif minikind == b'd': return _kind_directory elif minikind == b'a': return _kind_absent elif minikind == b'r': return _kind_relocated elif minikind == b'l': return _kind_symlink elif minikind == b't': return _kind_tree_reference _minikind[0] = minikind raise KeyError(PyBytes_FromStringAndSize(_minikind, 1)) cdef int _versioned_minikind(char minikind): # cannot_raise """Return non-zero if minikind is in fltd""" return (minikind == b'f' or minikind == b'd' or minikind == b'l' or minikind == b't') cdef utf8_decode(path: bytes): return codecs.utf_8_decode(path, 'surrogateescape')[0] cdef class ProcessEntryC: cdef int doing_consistency_expansion cdef object old_dirname_to_file_id # dict cdef object new_dirname_to_file_id # dict cdef object last_source_parent cdef object last_target_parent cdef int include_unchanged cdef int partial cdef object use_filesystem_for_exec cdef object utf8_decode cdef readonly object searched_specific_files cdef readonly object searched_exact_paths cdef object search_specific_files # The parents up to the root of the paths we are searching. # After all normal paths are returned, these specific items are returned. cdef object search_specific_file_parents cdef object state # Current iteration variables: cdef object current_root cdef object current_root_unicode cdef object root_entries cdef int root_entries_pos, root_entries_len cdef object root_abspath cdef int source_index, target_index cdef int want_unversioned cdef object tree cdef object dir_iterator cdef int block_index cdef object current_block cdef int current_block_pos cdef object current_block_list cdef object current_dir_info cdef object current_dir_list cdef object _pending_consistent_entries # list cdef int path_index cdef object root_dir_info cdef object bisect_left cdef object pathjoin cdef object fstat # A set of the ids we've output when doing partial output. cdef object seen_ids cdef object sha_file def __init__(self, include_unchanged, use_filesystem_for_exec, search_specific_files, state, source_index, target_index, want_unversioned, tree): self.doing_consistency_expansion = 0 self.old_dirname_to_file_id = {} self.new_dirname_to_file_id = {} # Are we doing a partial iter_changes? self.partial = set(['']).__ne__(search_specific_files) # Using a list so that we can access the values and change them in # nested scope. Each one is [path, file_id, entry] self.last_source_parent = [None, None] self.last_target_parent = [None, None] if include_unchanged is None: self.include_unchanged = False else: self.include_unchanged = int(include_unchanged) self.use_filesystem_for_exec = use_filesystem_for_exec # for all search_indexs in each path at or under each element of # search_specific_files, if the detail is relocated: add the id, and # add the relocated path as one to search if its not searched already. # If the detail is not relocated, add the id. self.searched_specific_files = set() # When we search exact paths without expanding downwards, we record # that here. self.searched_exact_paths = set() self.search_specific_files = search_specific_files # The parents up to the root of the paths we are searching. # After all normal paths are returned, these specific items are returned. self.search_specific_file_parents = set() # The ids we've sent out in the delta. self.seen_ids = set() self.state = state self.current_root = None self.current_root_unicode = None self.root_entries = None self.root_entries_pos = 0 self.root_entries_len = 0 self.root_abspath = None if source_index is None: self.source_index = -1 else: self.source_index = source_index self.target_index = target_index self.want_unversioned = want_unversioned self.tree = tree self.dir_iterator = None self.block_index = -1 self.current_block = None self.current_block_list = None self.current_block_pos = -1 self.current_dir_info = None self.current_dir_list = None self._pending_consistent_entries = [] self.path_index = 0 self.root_dir_info = None self.bisect_left = bisect.bisect_left self.pathjoin = pathjoin self.fstat = os.fstat self.sha_file = sha_file if target_index != 0: # A lot of code in here depends on target_index == 0 raise BzrFormatsError('unsupported target index') cdef _process_entry(self, entry, path_info): """Compare an entry and real disk to generate delta information. :param path_info: top_relpath, basename, kind, lstat, abspath for the path of entry. If None, then the path is considered absent in the target (Perhaps we should pass in a concrete entry for this ?) Basename is returned as a utf8 string because we expect this tuple will be ignored, and don't want to take the time to decode. :return: (iter_changes_result, changed). If the entry has not been handled then changed is None. Otherwise it is False if no content or metadata changes have occured, and True if any content or metadata change has occurred. If self.include_unchanged is True then if changed is not None, iter_changes_result will always be a result tuple. Otherwise, iter_changes_result is None unless changed is True. """ cdef char target_minikind cdef char source_minikind cdef object file_id cdef int content_change cdef object details_list file_id = None details_list = entry[1] if -1 == self.source_index: source_details = DirState.NULL_PARENT_DETAILS else: source_details = details_list[self.source_index] target_details = details_list[self.target_index] target_minikind = _minikind_from_string(target_details[0]) if path_info is not None and _versioned_minikind(target_minikind): if self.target_index != 0: raise AssertionError("Unsupported target index %d" % self.target_index) link_or_sha1 = update_entry(self.state, entry, path_info[4], path_info[3]) # The entry may have been modified by update_entry target_details = details_list[self.target_index] target_minikind = _minikind_from_string(target_details[0]) else: link_or_sha1 = None # the rest of this function is 0.3 seconds on 50K paths, or # 0.000006 seconds per call. source_minikind = _minikind_from_string(source_details[0]) if ((_versioned_minikind(source_minikind) or source_minikind == b'r') and _versioned_minikind(target_minikind)): # claimed content in both: diff # r | fdlt | | add source to search, add id path move and perform # | | | diff check on source-target # r | fdlt | a | dangling file that was present in the basis. # | | | ??? if source_minikind != b'r': old_dirname = entry[0][0] old_basename = entry[0][1] old_path = path = None else: # add the source to the search path to find any children it # has. TODO ? : only add if it is a container ? if (not self.doing_consistency_expansion and not is_inside_any(self.searched_specific_files, source_details[1])): self.search_specific_files.add(source_details[1]) # expanding from a user requested path, parent expansion # for delta consistency happens later. # generate the old path; this is needed for stating later # as well. old_path = source_details[1] old_dirname, old_basename = os.path.split(old_path) path = self.pathjoin(entry[0][0], entry[0][1]) old_entry = self.state._get_entry(self.source_index, path_utf8=old_path) # update the source details variable to be the real # location. if old_entry == (None, None): raise DirstateCorrupt(self.state._filename, "entry '%s/%s' is considered renamed from %r" " but source does not exist\n" "entry: %s" % (entry[0][0], entry[0][1], old_path, entry)) source_details = old_entry[1][self.source_index] source_minikind = _minikind_from_string(source_details[0]) if path_info is None: # the file is missing on disk, show as removed. content_change = 1 target_kind = None target_exec = False else: # source and target are both versioned and disk file is present. target_kind = path_info[2] if target_kind == 'directory': if path is None: old_path = path = self.pathjoin(old_dirname, old_basename) file_id = entry[0][2] self.new_dirname_to_file_id[path] = file_id if source_minikind != b'd': content_change = 1 else: # directories have no fingerprint content_change = 0 target_exec = False elif target_kind == 'file': if source_minikind != b'f': content_change = 1 else: # Check the sha. We can't just rely on the size as # content filtering may mean differ sizes actually # map to the same content if link_or_sha1 is None: # Stat cache miss: statvalue, link_or_sha1 = \ self.state._sha1_provider.stat_and_sha1( path_info[4]) self.state._observed_sha1(entry, link_or_sha1, statvalue) content_change = (link_or_sha1 != source_details[1]) # Target details is updated at update_entry time if self.use_filesystem_for_exec: # We don't need S_ISREG here, because we are sure # we are dealing with a file. target_exec = bool(S_IXUSR & path_info[3].st_mode) else: target_exec = target_details[3] elif target_kind == 'symlink': if source_minikind != b'l': content_change = 1 else: content_change = (link_or_sha1 != source_details[1]) target_exec = False elif target_kind == 'tree-reference': if source_minikind != b't': content_change = 1 else: content_change = 0 target_exec = False else: if path is None: path = self.pathjoin(old_dirname, old_basename) raise BadFileKindError(path, path_info[2]) if source_minikind == b'd': if path is None: old_path = path = self.pathjoin(old_dirname, old_basename) if file_id is None: file_id = entry[0][2] self.old_dirname_to_file_id[old_path] = file_id # parent id is the entry for the path in the target tree if old_basename and old_dirname == self.last_source_parent[0]: # use a cached hit for non-root source entries. source_parent_id = self.last_source_parent[1] else: try: source_parent_id = self.old_dirname_to_file_id[old_dirname] except KeyError, _: source_parent_entry = self.state._get_entry(self.source_index, path_utf8=old_dirname) source_parent_id = source_parent_entry[0][2] if source_parent_id == entry[0][2]: # This is the root, so the parent is None source_parent_id = None else: self.last_source_parent[0] = old_dirname self.last_source_parent[1] = source_parent_id new_dirname = entry[0][0] if entry[0][1] and new_dirname == self.last_target_parent[0]: # use a cached hit for non-root target entries. target_parent_id = self.last_target_parent[1] else: try: target_parent_id = self.new_dirname_to_file_id[new_dirname] except KeyError, _: # TODO: We don't always need to do the lookup, because the # parent entry will be the same as the source entry. target_parent_entry = self.state._get_entry(self.target_index, path_utf8=new_dirname) if target_parent_entry == (None, None): raise AssertionError( "Could not find target parent in wt: %s\nparent of: %s" % (new_dirname, entry)) target_parent_id = target_parent_entry[0][2] if target_parent_id == entry[0][2]: # This is the root, so the parent is None target_parent_id = None else: self.last_target_parent[0] = new_dirname self.last_target_parent[1] = target_parent_id source_exec = source_details[3] changed = (content_change or source_parent_id != target_parent_id or old_basename != entry[0][1] or source_exec != target_exec ) if not changed and not self.include_unchanged: return None, False else: if old_path is None: path = self.pathjoin(old_dirname, old_basename) old_path = path old_path_u = utf8_decode(old_path) path_u = old_path_u else: old_path_u = utf8_decode(old_path) if old_path == path: path_u = old_path_u else: path_u = utf8_decode(path) source_kind = _minikind_to_kind(source_minikind) return DirstateInventoryChange(entry[0][2], (old_path_u, path_u), content_change, (True, True), (source_parent_id, target_parent_id), (utf8_decode(old_basename), utf8_decode(entry[0][1])), (source_kind, target_kind), (source_exec, target_exec)), changed elif source_minikind == b'a' and _versioned_minikind(target_minikind): # looks like a new file path = self.pathjoin(entry[0][0], entry[0][1]) # parent id is the entry for the path in the target tree # TODO: these are the same for an entire directory: cache em. parent_entry = self.state._get_entry(self.target_index, path_utf8=entry[0][0]) if parent_entry is None: raise DirstateCorrupt(self.state, "We could not find the parent entry in index %d" " for the entry: %s" % (self.target_index, entry[0])) parent_id = parent_entry[0][2] if parent_id == entry[0][2]: parent_id = None if path_info is not None: # Present on disk: if self.use_filesystem_for_exec: # We need S_ISREG here, because we aren't sure if this # is a file or not. target_exec = bool( S_ISREG(path_info[3].st_mode) and S_IXUSR & path_info[3].st_mode) else: target_exec = target_details[3] return DirstateInventoryChange(entry[0][2], (None, utf8_decode(path)), True, (False, True), (None, parent_id), (None, utf8_decode(entry[0][1])), (None, path_info[2]), (None, target_exec)), True else: # Its a missing file, report it as such. return DirstateInventoryChange(entry[0][2], (None, utf8_decode(path)), False, (False, True), (None, parent_id), (None, utf8_decode(entry[0][1])), (None, None), (None, False)), True elif _versioned_minikind(source_minikind) and target_minikind == b'a': # unversioned, possibly, or possibly not deleted: we dont care. # if its still on disk, *and* theres no other entry at this # path [we dont know this in this routine at the moment - # perhaps we should change this - then it would be an unknown. old_path = self.pathjoin(entry[0][0], entry[0][1]) # parent id is the entry for the path in the target tree parent_id = self.state._get_entry(self.source_index, path_utf8=entry[0][0])[0][2] if parent_id == entry[0][2]: parent_id = None return DirstateInventoryChange( entry[0][2], (utf8_decode(old_path), None), True, (True, False), (parent_id, None), (utf8_decode(entry[0][1]), None), (_minikind_to_kind(source_minikind), None), (source_details[3], None)), True elif _versioned_minikind(source_minikind) and target_minikind == b'r': # a rename; could be a true rename, or a rename inherited from # a renamed parent. TODO: handle this efficiently. Its not # common case to rename dirs though, so a correct but slow # implementation will do. if (not self.doing_consistency_expansion and not is_inside_any(self.searched_specific_files, target_details[1])): self.search_specific_files.add(target_details[1]) # We don't expand the specific files parents list here as # the path is absent in target and won't create a delta with # missing parent. elif ((source_minikind == b'r' or source_minikind == b'a') and (target_minikind == b'r' or target_minikind == b'a')): # neither of the selected trees contain this path, # so skip over it. This is not currently directly tested, but # is indirectly via test_too_much.TestCommands.test_conflicts. pass else: raise AssertionError("don't know how to compare " "source_minikind=%r, target_minikind=%r" % (source_minikind, target_minikind)) ## import pdb;pdb.set_trace() return None, None def __iter__(self): return self def iter_changes(self): return self cdef int _gather_result_for_consistency(self, result) except -1: """Check a result we will yield to make sure we are consistent later. This gathers result's parents into a set to output later. :param result: A result tuple. """ if not self.partial or not result.file_id: return 0 self.seen_ids.add(result.file_id) new_path = result.path[1] if new_path: # Not the root and not a delete: queue up the parents of the path. self.search_specific_file_parents.update( [p.encode('utf-8') for p in parent_directories(new_path)]) # Add the root directory which parent_directories does not # provide. self.search_specific_file_parents.add(b'') return 0 cdef int _update_current_block(self) except -1: if (self.block_index < len(self.state._dirblocks) and is_inside(self.current_root, self.state._dirblocks[self.block_index][0])): self.current_block = self.state._dirblocks[self.block_index] self.current_block_list = self.current_block[1] self.current_block_pos = 0 else: self.current_block = None self.current_block_list = None return 0 def __next__(self): # Simple thunk to allow tail recursion without pyrex confusion return self._iter_next() cdef _iter_next(self): """Iterate over the changes.""" # This function single steps through an iterator. As such while loops # are often exited by 'return' - the code is structured so that the # next call into the function will return to the same while loop. Note # that all flow control needed to re-reach that step is reexecuted, # which can be a performance problem. It has not yet been tuned to # minimise this; a state machine is probably the simplest restructuring # to both minimise this overhead and make the code considerably more # understandable. # sketch: # compare source_index and target_index at or under each element of search_specific_files. # follow the following comparison table. Note that we only want to do diff operations when # the target is fdl because thats when the walkdirs logic will have exposed the pathinfo # for the target. # cases: # # Source | Target | disk | action # r | fdlt | | add source to search, add id path move and perform # | | | diff check on source-target # r | fdlt | a | dangling file that was present in the basis. # | | | ??? # r | a | | add source to search # r | a | a | # r | r | | this path is present in a non-examined tree, skip. # r | r | a | this path is present in a non-examined tree, skip. # a | fdlt | | add new id # a | fdlt | a | dangling locally added file, skip # a | a | | not present in either tree, skip # a | a | a | not present in any tree, skip # a | r | | not present in either tree at this path, skip as it # | | | may not be selected by the users list of paths. # a | r | a | not present in either tree at this path, skip as it # | | | may not be selected by the users list of paths. # fdlt | fdlt | | content in both: diff them # fdlt | fdlt | a | deleted locally, but not unversioned - show as deleted ? # fdlt | a | | unversioned: output deleted id for now # fdlt | a | a | unversioned and deleted: output deleted id # fdlt | r | | relocated in this tree, so add target to search. # | | | Dont diff, we will see an r,fd; pair when we reach # | | | this id at the other path. # fdlt | r | a | relocated in this tree, so add target to search. # | | | Dont diff, we will see an r,fd; pair when we reach # | | | this id at the other path. # TODO: jam 20070516 - Avoid the _get_entry lookup overhead by # keeping a cache of directories that we have seen. cdef object current_dirname, current_blockname cdef char * current_dirname_c cdef char * current_blockname_c cdef int advance_entry, advance_path cdef int path_handled searched_specific_files = self.searched_specific_files # Are we walking a root? while self.root_entries_pos < self.root_entries_len: entry = self.root_entries[self.root_entries_pos] self.root_entries_pos = self.root_entries_pos + 1 result, changed = self._process_entry(entry, self.root_dir_info) if changed is not None: if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: return result # Have we finished the prior root, or never started one ? if self.current_root is None: # TODO: the pending list should be lexically sorted? the # interface doesn't require it. try: self.current_root = self.search_specific_files.pop() except KeyError, _: raise StopIteration() self.searched_specific_files.add(self.current_root) # process the entries for this containing directory: the rest will be # found by their parents recursively. self.root_entries = self.state._entries_for_path(self.current_root) self.root_entries_len = len(self.root_entries) self.current_root_unicode = self.current_root.decode('utf8') self.root_abspath = self.tree.abspath(self.current_root_unicode) try: root_stat = os.lstat(self.root_abspath) except OSError, e: if e.errno == errno.ENOENT: # the path does not exist: let _process_entry know that. self.root_dir_info = None else: # some other random error: hand it up. raise else: self.root_dir_info = (b'', self.current_root, file_kind_from_stat_mode(root_stat.st_mode), root_stat, self.root_abspath) if self.root_dir_info[2] == 'directory': if self.tree._directory_is_tree_reference( self.current_root_unicode): self.root_dir_info = self.root_dir_info[:2] + \ ('tree-reference',) + self.root_dir_info[3:] if not self.root_entries and not self.root_dir_info: # this specified path is not present at all, skip it. # (tail recursion, can do a loop once the full structure is # known). return self._iter_next() path_handled = 0 self.root_entries_pos = 0 # XXX Clarity: This loop is duplicated a out the self.current_root # is None guard above: if we return from it, it completes there # (and the following if block cannot trigger because # path_handled must be true, so the if block is not # duplicated. while self.root_entries_pos < self.root_entries_len: entry = self.root_entries[self.root_entries_pos] self.root_entries_pos = self.root_entries_pos + 1 result, changed = self._process_entry(entry, self.root_dir_info) if changed is not None: path_handled = -1 if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: return result # handle unversioned specified paths: if self.want_unversioned and not path_handled and self.root_dir_info: new_executable = bool( stat.S_ISREG(self.root_dir_info[3].st_mode) and stat.S_IEXEC & self.root_dir_info[3].st_mode) return DirstateInventoryChange( None, (None, self.current_root_unicode), True, (False, False), (None, None), (None, splitpath(self.current_root_unicode)[-1]), (None, self.root_dir_info[2]), (None, new_executable) ) # If we reach here, the outer flow continues, which enters into the # per-root setup logic. if (self.current_dir_info is None and self.current_block is None and not self.doing_consistency_expansion): # setup iteration of this root: self.current_dir_list = None if self.root_dir_info and self.root_dir_info[2] == 'tree-reference': self.current_dir_info = None else: self.dir_iterator = _walkdirs_utf8(self.root_abspath, prefix=self.current_root) self.path_index = 0 try: self.current_dir_info = next(self.dir_iterator) self.current_dir_list = self.current_dir_info[1] except OSError, e: # there may be directories in the inventory even though # this path is not a file on disk: so mark it as end of # iterator if e.errno in (errno.ENOENT, errno.ENOTDIR, errno.EINVAL): self.current_dir_info = None elif sys.platform == 'win32': # on win32, python2.4 has e.errno == ERROR_DIRECTORY, but # python 2.5 has e.errno == EINVAL, # and e.winerror == ERROR_DIRECTORY try: e_winerror = e.winerror except AttributeError, _: e_winerror = None win_errors = (ERROR_DIRECTORY, ERROR_PATH_NOT_FOUND) if (e.errno in win_errors or e_winerror in win_errors): self.current_dir_info = None else: # Will this really raise the right exception ? raise else: raise else: if self.current_dir_info[0][0] == b'': # remove .bzr from iteration bzr_index = self.bisect_left(self.current_dir_list, (b'.bzr',)) if self.current_dir_list[bzr_index][0] != b'.bzr': raise AssertionError() del self.current_dir_list[bzr_index] initial_key = (self.current_root, b'', b'') self.block_index, _ = self.state._find_block_index_from_key(initial_key) if self.block_index == 0: # we have processed the total root already, but because the # initial key matched it we should skip it here. self.block_index = self.block_index + 1 self._update_current_block() # walk until both the directory listing and the versioned metadata # are exhausted. while (self.current_dir_info is not None or self.current_block is not None): # Uncommon case - a missing directory or an unversioned directory: if (self.current_dir_info and self.current_block and self.current_dir_info[0][0] != self.current_block[0]): # Work around pyrex broken heuristic - current_dirname has # the same scope as current_dirname_c current_dirname = self.current_dir_info[0][0] current_dirname_c = PyBytes_AS_STRING_void( current_dirname) current_blockname = self.current_block[0] current_blockname_c = PyBytes_AS_STRING_void( current_blockname) # In the python generator we evaluate this if block once per # dir+block; because we reenter in the pyrex version its being # evaluated once per path: we could cache the result before # doing the while loop and probably save time. if _cmp_by_dirs(current_dirname_c, PyBytes_Size(current_dirname), current_blockname_c, PyBytes_Size(current_blockname)) < 0: # filesystem data refers to paths not covered by the # dirblock. this has two possibilities: # A) it is versioned but empty, so there is no block for it # B) it is not versioned. # if (A) then we need to recurse into it to check for # new unknown files or directories. # if (B) then we should ignore it, because we don't # recurse into unknown directories. # We are doing a loop while self.path_index < len(self.current_dir_list): current_path_info = self.current_dir_list[self.path_index] # dont descend into this unversioned path if it is # a dir if current_path_info[2] in ('directory', 'tree-reference'): del self.current_dir_list[self.path_index] self.path_index = self.path_index - 1 self.path_index = self.path_index + 1 if self.want_unversioned: if current_path_info[2] == 'directory': if self.tree._directory_is_tree_reference( utf8_decode(current_path_info[0])): current_path_info = current_path_info[:2] + \ ('tree-reference',) + current_path_info[3:] new_executable = bool( stat.S_ISREG(current_path_info[3].st_mode) and stat.S_IEXEC & current_path_info[3].st_mode) return DirstateInventoryChange( None, (None, utf8_decode(current_path_info[0])), True, (False, False), (None, None), (None, utf8_decode(current_path_info[1])), (None, current_path_info[2]), (None, new_executable)) # This dir info has been handled, go to the next self.path_index = 0 self.current_dir_list = None try: self.current_dir_info = next(self.dir_iterator) self.current_dir_list = self.current_dir_info[1] except StopIteration, _: self.current_dir_info = None else: #(dircmp > 0) # We have a dirblock entry for this location, but there # is no filesystem path for this. This is most likely # because a directory was removed from the disk. # We don't have to report the missing directory, # because that should have already been handled, but we # need to handle all of the files that are contained # within. while self.current_block_pos < len(self.current_block_list): current_entry = self.current_block_list[self.current_block_pos] self.current_block_pos = self.current_block_pos + 1 # entry referring to file not present on disk. # advance the entry only, after processing. result, changed = self._process_entry(current_entry, None) if changed is not None: if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: return result self.block_index = self.block_index + 1 self._update_current_block() continue # next loop-on-block/dir result = self._loop_one_block() if result is not None: return result if len(self.search_specific_files): # More supplied paths to process self.current_root = None return self._iter_next() # Start expanding more conservatively, adding paths the user may not # have intended but required for consistent deltas. self.doing_consistency_expansion = 1 if not self._pending_consistent_entries: self._pending_consistent_entries = self._next_consistent_entries() while self._pending_consistent_entries: result, changed = self._pending_consistent_entries.pop() if changed is not None: return result raise StopIteration() cdef object _maybe_tree_ref(self, current_path_info): if self.tree._directory_is_tree_reference( utf8_decode(current_path_info[0])): return current_path_info[:2] + \ ('tree-reference',) + current_path_info[3:] else: return current_path_info cdef object _loop_one_block(self): # current_dir_info and current_block refer to the same directory - # this is the common case code. # Assign local variables for current path and entry: cdef object current_entry cdef object current_path_info cdef int path_handled cdef char minikind cdef int cmp_result # cdef char * temp_str # cdef Py_ssize_t temp_str_length # PyBytes_AsStringAndSize(disk_kind, &temp_str, &temp_str_length) # if not strncmp(temp_str, "directory", temp_str_length): if (self.current_block is not None and self.current_block_pos < PyList_GET_SIZE(self.current_block_list)): current_entry = PyList_GET_ITEM(self.current_block_list, self.current_block_pos) # accomodate pyrex Py_INCREF(current_entry) else: current_entry = None if (self.current_dir_info is not None and self.path_index < PyList_GET_SIZE(self.current_dir_list)): current_path_info = PyList_GET_ITEM(self.current_dir_list, self.path_index) # accomodate pyrex Py_INCREF(current_path_info) disk_kind = PyTuple_GET_ITEM(current_path_info, 2) # accomodate pyrex Py_INCREF(disk_kind) if disk_kind == "directory": current_path_info = self._maybe_tree_ref(current_path_info) else: current_path_info = None while (current_entry is not None or current_path_info is not None): advance_entry = -1 advance_path = -1 result = None changed = None path_handled = 0 if current_entry is None: # unversioned - the check for path_handled when the path # is advanced will yield this path if needed. pass elif current_path_info is None: # no path is fine: the per entry code will handle it. result, changed = self._process_entry(current_entry, current_path_info) else: minikind = _minikind_from_string( current_entry[1][self.target_index][0]) cmp_result = ((current_path_info[1] > current_entry[0][1]) - (current_path_info[1] < current_entry[0][1])) if (cmp_result or minikind == b'a' or minikind == b'r'): # The current path on disk doesn't match the dirblock # record. Either the dirblock record is marked as # absent/renamed, or the file on disk is not present at all # in the dirblock. Either way, report about the dirblock # entry, and let other code handle the filesystem one. # Compare the basename for these files to determine # which comes first if cmp_result < 0: # extra file on disk: pass for now, but only # increment the path, not the entry advance_entry = 0 else: # entry referring to file not present on disk. # advance the entry only, after processing. result, changed = self._process_entry(current_entry, None) advance_path = 0 else: # paths are the same,and the dirstate entry is not # absent or renamed. result, changed = self._process_entry(current_entry, current_path_info) if changed is not None: path_handled = -1 if not changed and not self.include_unchanged: changed = None # >- loop control starts here: # >- entry if advance_entry and current_entry is not None: self.current_block_pos = self.current_block_pos + 1 if self.current_block_pos < PyList_GET_SIZE(self.current_block_list): current_entry = self.current_block_list[self.current_block_pos] else: current_entry = None # >- path if advance_path and current_path_info is not None: if not path_handled: # unversioned in all regards if self.want_unversioned: new_executable = bool( stat.S_ISREG(current_path_info[3].st_mode) and stat.S_IEXEC & current_path_info[3].st_mode) relpath_unicode = utf8_decode(current_path_info[0]) if changed is not None: raise AssertionError( "result is not None: %r" % result) result = DirstateInventoryChange( None, (None, relpath_unicode), True, (False, False), (None, None), (None, utf8_decode(current_path_info[1])), (None, current_path_info[2]), (None, new_executable)) changed = True # dont descend into this unversioned path if it is # a dir if current_path_info[2] in ('directory'): del self.current_dir_list[self.path_index] self.path_index = self.path_index - 1 # dont descend the disk iterator into any tree # paths. if current_path_info[2] == 'tree-reference': del self.current_dir_list[self.path_index] self.path_index = self.path_index - 1 self.path_index = self.path_index + 1 if self.path_index < len(self.current_dir_list): current_path_info = self.current_dir_list[self.path_index] if current_path_info[2] == 'directory': current_path_info = self._maybe_tree_ref( current_path_info) else: current_path_info = None if changed is not None: # Found a result on this pass, yield it if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: return result if self.current_block is not None: self.block_index = self.block_index + 1 self._update_current_block() if self.current_dir_info is not None: self.path_index = 0 self.current_dir_list = None try: self.current_dir_info = next(self.dir_iterator) self.current_dir_list = self.current_dir_info[1] except StopIteration, _: self.current_dir_info = None cdef object _next_consistent_entries(self): """Grabs the next specific file parent case to consider. :return: A list of the results, each of which is as for _process_entry. """ results = [] while self.search_specific_file_parents: # Process the parent directories for the paths we were iterating. # Even in extremely large trees this should be modest, so currently # no attempt is made to optimise. path_utf8 = self.search_specific_file_parents.pop() if path_utf8 in self.searched_exact_paths: # We've examined this path. continue if is_inside_any(self.searched_specific_files, path_utf8): # We've examined this path. continue path_entries = self.state._entries_for_path(path_utf8) # We need either one or two entries. If the path in # self.target_index has moved (so the entry in source_index is in # 'ar') then we need to also look for the entry for this path in # self.source_index, to output the appropriate delete-or-rename. selected_entries = [] found_item = False for candidate_entry in path_entries: # Find entries present in target at this path: if candidate_entry[1][self.target_index][0] not in (b'a', b'r'): found_item = True selected_entries.append(candidate_entry) # Find entries present in source at this path: elif (self.source_index is not None and candidate_entry[1][self.source_index][0] not in (b'a', b'r')): found_item = True if candidate_entry[1][self.target_index][0] == b'a': # Deleted, emit it here. selected_entries.append(candidate_entry) else: # renamed, emit it when we process the directory it # ended up at. self.search_specific_file_parents.add( candidate_entry[1][self.target_index][1]) if not found_item: raise AssertionError( "Missing entry for specific path parent %r, %r" % ( path_utf8, path_entries)) path_info = self._path_info(path_utf8, path_utf8.decode('utf8')) for entry in selected_entries: if entry[0][2] in self.seen_ids: continue result, changed = self._process_entry(entry, path_info) if changed is None: raise AssertionError( "Got entry<->path mismatch for specific path " "%r entry %r path_info %r " % ( path_utf8, entry, path_info)) # Only include changes - we're outside the users requested # expansion. if changed: self._gather_result_for_consistency(result) if (result.kind[0] == 'directory' and result.kind[1] != 'directory'): # This stopped being a directory, the old children have # to be included. if entry[1][self.source_index][0] == b'r': # renamed, take the source path entry_path_utf8 = entry[1][self.source_index][1] else: entry_path_utf8 = path_utf8 initial_key = (entry_path_utf8, b'', b'') block_index, _ = self.state._find_block_index_from_key( initial_key) if block_index == 0: # The children of the root are in block index 1. block_index = block_index + 1 current_block = None if block_index < len(self.state._dirblocks): current_block = self.state._dirblocks[block_index] if not is_inside( entry_path_utf8, current_block[0]): # No entries for this directory at all. current_block = None if current_block is not None: for entry in current_block[1]: if entry[1][self.source_index][0] in (b'a', b'r'): # Not in the source tree, so doesn't have to be # included. continue # Path of the entry itself. self.search_specific_file_parents.add( self.pathjoin(*entry[0][:2])) if changed or self.include_unchanged: results.append((result, changed)) self.searched_exact_paths.add(path_utf8) return results cdef object _path_info(self, utf8_path, unicode_path): """Generate path_info for unicode_path. :return: None if unicode_path does not exist, or a path_info tuple. """ abspath = self.tree.abspath(unicode_path) try: stat = os.lstat(abspath) except OSError, e: if e.errno == errno.ENOENT: # the path does not exist. return None else: raise utf8_basename = utf8_path.rsplit(b'/', 1)[-1] dir_info = (utf8_path, utf8_basename, file_kind_from_stat_mode(stat.st_mode), stat, abspath) if dir_info[2] == 'directory': if self.tree._directory_is_tree_reference( unicode_path): self.root_dir_info = self.root_dir_info[:2] + \ ('tree-reference',) + self.root_dir_info[3:] return dir_info bzrformats_3.4.0.orig/bzrformats/_groupcompress_pyx.pyx0000644000000000000000000003134415162073400020563 0ustar00# Copyright (C) 2008, 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # cython: language_level=3 """Compiled extensions for doing compression.""" cdef extern from "python-compat.h": pass from cpython.bytes cimport (PyBytes_AS_STRING, PyBytes_CheckExact, PyBytes_FromStringAndSize, PyBytes_GET_SIZE) from cpython.mem cimport PyMem_Free, PyMem_Malloc from cpython.object cimport PyObject from libc.stdlib cimport free from libc.string cimport memcpy cdef extern from "delta.h": struct source_info: void *buf unsigned long size unsigned long agg_offset struct delta_index: pass ctypedef enum delta_result: DELTA_OK DELTA_OUT_OF_MEMORY DELTA_INDEX_NEEDED DELTA_SOURCE_EMPTY DELTA_SOURCE_BAD DELTA_BUFFER_EMPTY DELTA_SIZE_TOO_BIG delta_result create_delta_index(source_info *src, delta_index *old, delta_index **fresh, int max_entries) nogil delta_result create_delta_index_from_delta(source_info *delta, delta_index *old, delta_index **fresh) nogil void free_delta_index(delta_index *index) nogil delta_result create_delta(delta_index *indexes, void *buf, unsigned long bufsize, unsigned long *delta_size, unsigned long max_delta_size, void **delta_data) nogil unsigned long get_delta_hdr_size(unsigned char **datap, unsigned char *top) nogil unsigned long sizeof_delta_index(delta_index *index) Py_ssize_t DELTA_SIZE_MIN int get_hash_offset(delta_index *index, int pos, unsigned int *hash_offset) int get_entry_summary(delta_index *index, int pos, unsigned int *global_offset, unsigned int *hash_val) unsigned int rabin_hash (unsigned char *data) def make_delta_index(source): return DeltaIndex(source) cdef object _translate_delta_failure(delta_result result): if result == DELTA_OUT_OF_MEMORY: return MemoryError("Delta function failed to allocate memory") elif result == DELTA_INDEX_NEEDED: return ValueError("Delta function requires delta_index param") elif result == DELTA_SOURCE_EMPTY: return ValueError("Delta function given empty source_info param") elif result == DELTA_SOURCE_BAD: return RuntimeError("Delta function given invalid source_info param") elif result == DELTA_BUFFER_EMPTY: return ValueError("Delta function given empty buffer params") return AssertionError("Unrecognised delta result code: %d" % result) def _rabin_hash(content): if not PyBytes_CheckExact(content): raise ValueError('content must be a string') if len(content) < 16: raise ValueError('content must be at least 16 bytes long') # Try to cast it to an int, if it can fit return int(rabin_hash((PyBytes_AS_STRING(content)))) cdef class DeltaIndex: cdef readonly list _sources cdef source_info *_source_infos cdef delta_index *_index cdef public unsigned long _source_offset cdef readonly unsigned int _max_num_sources cdef public int _max_bytes_to_index def __init__(self, source=None, max_bytes_to_index=None): self._sources = [] self._index = NULL self._max_num_sources = 65000 self._source_infos = PyMem_Malloc( sizeof(source_info) * self._max_num_sources) if self._source_infos == NULL: raise MemoryError('failed to allocate memory for DeltaIndex') self._source_offset = 0 self._max_bytes_to_index = 0 if max_bytes_to_index is not None: self._max_bytes_to_index = max_bytes_to_index if source is not None: self.add_source(source, 0) def __sizeof__(self): # We want to track the _source_infos allocations, but the referenced # void* are actually tracked in _sources itself. return (sizeof(DeltaIndex) + (sizeof(source_info) * self._max_num_sources) + sizeof_delta_index(self._index)) def __repr__(self): return '%s(%d, %d)' % (self.__class__.__name__, len(self._sources), self._source_offset) def __dealloc__(self): if self._index != NULL: free_delta_index(self._index) self._index = NULL PyMem_Free(self._source_infos) def _has_index(self): return (self._index != NULL) def _dump_index(self): """Dump the pointers in the index. This is an arbitrary layout, used for testing. It is not meant to be used in production code. :return: (hash_list, entry_list) hash_list A list of offsets, so hash[i] points to the 'hash bucket' starting at the given offset and going until hash[i+1] entry_list A list of (text_offset, hash_val). text_offset is the offset in the "source" texts, and hash_val is the RABIN hash for that offset. Note that the entry should be in the hash bucket defined by hash[(hash_val & mask)] && hash[(hash_val & mask) + 1] """ cdef int pos cdef unsigned int text_offset cdef unsigned int hash_val cdef unsigned int hash_offset if self._index == NULL: return None hash_list = [] pos = 0 while get_hash_offset(self._index, pos, &hash_offset): hash_list.append(int(hash_offset)) pos += 1 entry_list = [] pos = 0 while get_entry_summary(self._index, pos, &text_offset, &hash_val): # Map back using 'int' so that we don't get Long everywhere, when # almost everything is <2**31. val = tuple(map(int, [text_offset, hash_val])) entry_list.append(val) pos += 1 return hash_list, entry_list def add_delta_source(self, delta, unadded_bytes): """Add a new delta to the source texts. :param delta: The text of the delta, this must be a byte string. :param unadded_bytes: Number of bytes that were added to the source that were not indexed. """ cdef char *c_delta cdef Py_ssize_t c_delta_size cdef delta_index *index cdef delta_result res cdef unsigned int source_location cdef source_info *src cdef unsigned int num_indexes if not PyBytes_CheckExact(delta): raise TypeError('delta is not a bytestring') source_location = len(self._sources) if source_location >= self._max_num_sources: self._expand_sources() self._sources.append(delta) c_delta = PyBytes_AS_STRING(delta) c_delta_size = PyBytes_GET_SIZE(delta) src = self._source_infos + source_location src.buf = c_delta src.size = c_delta_size src.agg_offset = self._source_offset + unadded_bytes with nogil: res = create_delta_index_from_delta(src, self._index, &index) if res != DELTA_OK: raise _translate_delta_failure(res) self._source_offset = src.agg_offset + src.size if index != self._index: free_delta_index(self._index) self._index = index def add_source(self, source, unadded_bytes): """Add a new bit of source text to the delta indexes. :param source: The text in question, this must be a byte string :param unadded_bytes: Assume there are this many bytes that didn't get added between this source and the end of the previous source. :param max_pointers: Add no more than this many entries to the index. By default, we sample every 16 bytes, if that would require more than max_entries, we will reduce the sampling rate. A value of 0 means unlimited, None means use the default limit. """ cdef char *c_source cdef Py_ssize_t c_source_size cdef delta_index *index cdef delta_result res cdef unsigned int source_location cdef source_info *src cdef unsigned int num_indexes cdef int max_num_entries if not PyBytes_CheckExact(source): raise TypeError('source is not a bytestring') source_location = len(self._sources) if source_location >= self._max_num_sources: self._expand_sources() if source_location != 0 and self._index == NULL: # We were lazy about populating the index, create it now self._populate_first_index() self._sources.append(source) c_source = PyBytes_AS_STRING(source) c_source_size = PyBytes_GET_SIZE(source) src = self._source_infos + source_location src.buf = c_source src.size = c_source_size src.agg_offset = self._source_offset + unadded_bytes self._source_offset = src.agg_offset + src.size # We delay creating the index on the first insert if source_location != 0: with nogil: res = create_delta_index(src, self._index, &index, self._max_bytes_to_index) if res != DELTA_OK: raise _translate_delta_failure(res) if index != self._index: free_delta_index(self._index) self._index = index cdef _populate_first_index(self): cdef delta_index *index cdef delta_result res if len(self._sources) != 1 or self._index != NULL: raise AssertionError('_populate_first_index should only be' ' called when we have a single source and no index yet') # We know that self._index is already NULL, so create_delta_index # will always create a new index unless there's a malloc failure with nogil: res = create_delta_index(&self._source_infos[0], NULL, &index, self._max_bytes_to_index) if res != DELTA_OK: raise _translate_delta_failure(res) self._index = index cdef _expand_sources(self): raise RuntimeError('if we move self._source_infos, then we need to' ' change all of the index pointers as well.') def make_delta(self, target_bytes, max_delta_size=0): """Create a delta from the current source to the target bytes.""" cdef char *target cdef Py_ssize_t target_size cdef void * delta cdef unsigned long delta_size cdef unsigned long c_max_delta_size cdef delta_result res if self._index == NULL: if len(self._sources) == 0: return None # We were just lazy about generating the index self._populate_first_index() if not PyBytes_CheckExact(target_bytes): raise TypeError('target is not a bytestring') target = PyBytes_AS_STRING(target_bytes) target_size = PyBytes_GET_SIZE(target_bytes) # TODO: inline some of create_delta so we at least don't have to double # malloc, and can instead use PyBytes_FromStringAndSize, to # allocate the bytes into the final string c_max_delta_size = max_delta_size with nogil: res = create_delta(self._index, target, target_size, &delta_size, c_max_delta_size, &delta) result = None if res == DELTA_OK: result = PyBytes_FromStringAndSize(delta, delta_size) free(delta) elif res != DELTA_SIZE_TOO_BIG: raise _translate_delta_failure(res) return result def make_delta(source_bytes, target_bytes): """Create a delta, this is a wrapper around DeltaIndex.make_delta.""" di = DeltaIndex(source_bytes) return di.make_delta(target_bytes) bzrformats_3.4.0.orig/bzrformats/_knit_load_data_py.py0000644000000000000000000000701415162073400020225 0ustar00# Copyright (C) 2007 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA from bzrformats.knit import KnitCorrupt def _load_data_py(kndx, fp): """Read in a knit index.""" cache = kndx._cache history = kndx._history kndx.check_header(fp) # readlines reads the whole file at once: # bad for transports like http, good for local disk # we save 60 ms doing this one change ( # from calling readline each time to calling # readlines once. # probably what we want for nice behaviour on # http is a incremental readlines that yields, or # a check for local vs non local indexes, history_top = len(history) - 1 for line in fp.readlines(): rec = line.split() if len(rec) < 5 or rec[-1] != b":": # corrupt line. # FIXME: in the future we should determine if its a # short write - and ignore it # or a different failure, and raise. RBC 20060407 continue try: parents = [] for value in rec[4:-1]: if value[:1] == b".": # uncompressed reference parent_id = value[1:] else: parent_id = history[int(value)] parents.append(parent_id) except (IndexError, ValueError) as e: # The parent could not be decoded to get its parent row. This # at a minimum will cause this row to have wrong parents, or # even to apply a delta to the wrong base and decode # incorrectly. its therefore not usable, and because we have # encountered a situation where a new knit index had this # corrupt we can't asssume that no other rows referring to the # index of this record actually mean the subsequent uncorrupt # one, so we error. raise KnitCorrupt(kndx._filename, f"line {rec!r}: {e}") from e version_id, options, pos, size = rec[:4] try: pos = int(pos) except ValueError as e: raise KnitCorrupt( kndx._filename, f"invalid position on line {rec!r}: {e}" ) from e try: size = int(size) except ValueError as e: raise KnitCorrupt( kndx._filename, f"invalid size on line {rec!r}: {e}" ) from e # See kndx._cache_version # only want the _history index to reference the 1st # index entry for version_id if version_id not in cache: history_top += 1 index = history_top history.append(version_id) else: index = cache[version_id][5] cache[version_id] = ( version_id, options.split(b","), pos, size, tuple(parents), index, ) # end kndx._cache_version bzrformats_3.4.0.orig/bzrformats/_knit_load_data_pyx.pyx0000644000000000000000000002355215162073400020612 0ustar00# Copyright (C) 2007-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # cython: language_level=3 """Pyrex extensions to knit parsing.""" import sys from .knit import KnitCorrupt from cpython.bytes cimport (PyBytes_AsString, PyBytes_CheckExact, PyBytes_FromStringAndSize, PyBytes_Size) from cpython.dict cimport PyDict_CheckExact, PyDict_SetItem from cpython.list cimport PyList_Append, PyList_CheckExact, PyList_GET_ITEM from libc.stdlib cimport strtol from libc.string cimport memchr cdef extern from "Python.h": void *PyDict_GetItem_void "PyDict_GetItem" (object p, object key) void *PyTuple_GetItem_void_void "PyTuple_GET_ITEM" (void* tpl, int index) cdef int string_to_int_safe(char *s, char *end, int *out) except -1: """Convert a base10 string to an integer. This makes sure the whole string is consumed, or it raises ValueError. This is similar to how int(s) works, except you don't need a Python String object. :param s: The string to convert :param end: The character after the integer. So if the string is '12\0', this should be pointing at the '\0'. If the string was '12 ' then this should point at the ' '. :param out: This is the integer that will be returned :return: -1 if an exception is raised. 0 otherwise """ cdef char *integer_end # We can't just return the integer because of how pyrex determines when # there is an exception. out[0] = strtol(s, &integer_end, 10) if integer_end != end: py_s = PyBytes_FromStringAndSize(s, end-s) raise ValueError('%r is not a valid integer' % (py_s,)) return 0 cdef class KnitIndexReader: cdef object kndx cdef object fp cdef object cache cdef object history cdef char * cur_str cdef char * end_str cdef int history_len def __init__(self, kndx, fp): self.kndx = kndx self.fp = fp self.cache = kndx._cache self.history = kndx._history self.cur_str = NULL self.end_str = NULL self.history_len = 0 cdef int validate(self) except -1: if not PyDict_CheckExact(self.cache): raise TypeError('kndx._cache must be a python dict') if not PyList_CheckExact(self.history): raise TypeError('kndx._history must be a python list') return 0 cdef object process_options(self, char *option_str, char *end): """Process the options string into a list.""" cdef char *n # This is alternative code which creates a python string and splits it. # It is "correct" and more obvious, but slower than the following code. # It can be uncommented to switch in case the other code is seen as # suspect. # options = PyBytes_FromStringAndSize(option_str, end - option_str) # return options.split(',') final_options = [] while option_str < end: n = memchr(option_str, c',', end - option_str) if n == NULL: n = end n_option = PyBytes_FromStringAndSize(option_str, n - option_str) PyList_Append(final_options, n_option) # Move past the ',' option_str = n+1 return final_options cdef object process_parents(self, char *parent_str, char *end): cdef char *n cdef int int_parent cdef char *parent_end # Alternative, correct but slower code. # # parents = PyBytes_FromStringAndSize(parent_str, end - parent_str) # real_parents = [] # for parent in parents.split(): # if parent[0].startswith('.'): # real_parents.append(parent[1:]) # else: # real_parents.append(self.history[int(parent)]) # return real_parents parents = [] while parent_str <= end: n = memchr(parent_str, c' ', end - parent_str) if n == NULL or n >= end or n == parent_str: break if parent_str[0] == c'.': # This is an explicit revision id parent_str = parent_str + 1 parent = PyBytes_FromStringAndSize(parent_str, n - parent_str) else: # This in an integer mapping to original string_to_int_safe(parent_str, n, &int_parent) if int_parent >= self.history_len: raise IndexError('Parent index refers to a revision which' ' does not exist yet.' ' %d > %d' % (int_parent, self.history_len)) # PyList_GET_ITEM steals a reference but object cast INCREFs parent = PyList_GET_ITEM(self.history, int_parent) PyList_Append(parents, parent) parent_str = n + 1 return tuple(parents) cdef int process_one_record(self, char *start, char *end) except -1: """Take a simple string and split it into an index record.""" cdef char *version_id_str cdef int version_id_size cdef char *option_str cdef char *option_end cdef char *pos_str cdef int pos cdef char *size_str cdef int size cdef char *parent_str cdef int parent_size cdef void *cache_entry version_id_str = start option_str = memchr(version_id_str, c' ', end - version_id_str) if option_str == NULL or option_str >= end: # Short entry return 0 version_id_size = (option_str - version_id_str) # Move past the space character option_str = option_str + 1 pos_str = memchr(option_str, c' ', end - option_str) if pos_str == NULL or pos_str >= end: # Short entry return 0 option_end = pos_str pos_str = pos_str + 1 size_str = memchr(pos_str, c' ', end - pos_str) if size_str == NULL or size_str >= end: # Short entry return 0 size_str = size_str + 1 parent_str = memchr(size_str, c' ', end - size_str) if parent_str == NULL or parent_str >= end: # Missing parents return 0 parent_str = parent_str + 1 version_id = PyBytes_FromStringAndSize(version_id_str, version_id_size) options = self.process_options(option_str, option_end) try: string_to_int_safe(pos_str, size_str - 1, &pos) string_to_int_safe(size_str, parent_str - 1, &size) parents = self.process_parents(parent_str, end) except (ValueError, IndexError), e: py_line = PyBytes_FromStringAndSize(start, end - start) raise KnitCorrupt(self.kndx._filename, "line %r: %s" % (py_line, e)) cache_entry = PyDict_GetItem_void(self.cache, version_id) if cache_entry == NULL: PyList_Append(self.history, version_id) index = self.history_len self.history_len = self.history_len + 1 else: # PyTuple_GetItem_void_void does *not* increment the reference # counter, but casting to does. index = PyTuple_GetItem_void_void(cache_entry, 5) PyDict_SetItem(self.cache, version_id, (version_id, options, pos, size, parents, index, )) return 1 cdef int process_next_record(self) except -1: """Process the next record in the file.""" cdef char *last cdef char *start start = self.cur_str # Find the next newline last = memchr(start, c'\n', self.end_str - start) if last == NULL: # Process until the end of the file last = self.end_str - 1 self.cur_str = self.end_str else: # The last character is right before the '\n' # And the next string is right after it self.cur_str = last + 1 last = last - 1 if last <= start or last[0] != c':': # Incomplete record return 0 return self.process_one_record(start, last) def read(self): cdef int text_size self.validate() self.kndx.check_header(self.fp) # We read the whole thing at once # TODO: jam 2007-05-09 Consider reading incrementally rather than # having to have the whole thing read up front. # we already know that calling f.readlines() versus lots of # f.readline() calls is faster. # The other possibility is to avoid a Python String here # completely. However self.fp may be a 'file-like' object # it is not guaranteed to be a real file. text = self.fp.read() text_size = PyBytes_Size(text) self.cur_str = PyBytes_AsString(text) # This points to the last character in the string self.end_str = self.cur_str + text_size while self.cur_str < self.end_str: self.process_next_record() cpdef _load_data_c(kndx, fp): """Load the knit index file into memory.""" reader = KnitIndexReader(kndx, fp) reader.read() bzrformats_3.4.0.orig/bzrformats/_str_helpers.pxd0000644000000000000000000000346615162073400017264 0ustar00# Copyright (C) 2007-2010 Canonical Ltd # Copyright (C) 2018 Breezy developers # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # cython: language_level=3 """Trivial string helpers for use in other cython modules.""" cdef extern from "python-compat.h": object PyBytes_FromStringAndSize (char *, Py_ssize_t) cdef inline void* _my_memrchr(void *s, int c, size_t n): # cannot_raise # memrchr seems to be a GNU extension, so we have to implement it ourselves cdef char *pos cdef char *start start = s pos = start + n - 1 while pos >= start: if pos[0] == c: return pos pos = pos - 1 return NULL cdef inline object safe_string_from_size(char *s, Py_ssize_t size): if size < 0: raise AssertionError( 'tried to create a string with an invalid size: %d' % size) return PyBytes_FromStringAndSize(s, size) cdef inline object safe_interned_string_from_size(char *s, Py_ssize_t size): if size < 0: raise AssertionError( 'tried to create a string with an invalid size: %d' % size) # For now, don't intern on Python 3 return PyBytes_FromStringAndSize(s, size) bzrformats_3.4.0.orig/bzrformats/annotate.py0000644000000000000000000003245115162115103016232 0ustar00# Copyright (C) 2005-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """File annotate based on VersionedFiles.""" from typing import TYPE_CHECKING from vcsgraph import ( known_graph as _mod_known_graph, ) from . import errors, osutils if TYPE_CHECKING: from bzrformats.versionedfile import VersionedFiles # Module-level variable that can be overridden for testing _break_annotation_tie = None class VersionedFileAnnotator: """Class that drives performing annotations.""" _vf: "VersionedFiles" def __init__(self, vf): """Create a new Annotator from a VersionedFile.""" self._vf = vf self._parent_map = {} self._text_cache = {} # Map from key => number of nexts that will be built from this key self._num_needed_children = {} self._annotations_cache = {} self._heads_provider = None self._ann_tuple_cache = {} def _update_needed_children(self, key, parent_keys): for parent_key in parent_keys: if parent_key in self._num_needed_children: self._num_needed_children[parent_key] += 1 else: self._num_needed_children[parent_key] = 1 def _get_needed_keys(self, key): """Determine the texts we need to get from the backing vf. :return: (vf_keys_needed, ann_keys_needed) vf_keys_needed These are keys that we need to get from the vf ann_keys_needed Texts which we have in self._text_cache but we don't have annotations for. We need to yield these in the proper order so that we can get proper annotations. """ parent_map = self._parent_map # We need 1 extra copy of the node we will be looking at when we are # done self._num_needed_children[key] = 1 vf_keys_needed = set() ann_keys_needed = set() needed_keys = {key} while needed_keys: parent_lookup = [] next_parent_map = {} for key in needed_keys: if key in self._parent_map: # We don't need to lookup this key in the vf if key not in self._text_cache: # Extract this text from the vf vf_keys_needed.add(key) elif key not in self._annotations_cache: # We do need to annotate ann_keys_needed.add(key) next_parent_map[key] = self._parent_map[key] else: parent_lookup.append(key) vf_keys_needed.add(key) needed_keys = set() next_parent_map.update(self._vf.get_parent_map(parent_lookup)) for key, parent_keys in next_parent_map.items(): if parent_keys is None: # No graph versionedfile parent_keys = () next_parent_map[key] = () self._update_needed_children(key, parent_keys) needed_keys.update( [key for key in parent_keys if key not in parent_map] ) parent_map.update(next_parent_map) # _heads_provider does some graph caching, so it is only valid # while self._parent_map hasn't changed self._heads_provider = None return vf_keys_needed, ann_keys_needed def _get_needed_texts(self, key, pb=None): """Get the texts we need to properly annotate key. :param key: A Key that is present in self._vf :return: Yield (this_key, text, num_lines) 'text' is an opaque object that just has to work with whatever matcher object we are using. Currently it is always 'lines' but future improvements may change this to a simple text string. """ keys, ann_keys = self._get_needed_keys(key) if pb is not None: pb.update("getting stream", 0, len(keys)) stream = self._vf.get_record_stream(keys, "topological", True) for _idx, record in enumerate(stream): if pb is not None: pb.update("extracting", 0, len(keys)) if record.storage_kind == "absent": raise errors.RevisionNotPresent(record.key, self._vf) this_key = record.key lines = record.get_bytes_as("lines") num_lines = len(lines) self._text_cache[this_key] = lines yield this_key, lines, num_lines for key in ann_keys: lines = self._text_cache[key] num_lines = len(lines) yield key, lines, num_lines def _get_parent_annotations_and_matches(self, key, text, parent_key): """Get the list of annotations for the parent, and the matching lines. :param text: The opaque value given by _get_needed_texts :param parent_key: The key for the parent text :return: (parent_annotations, matching_blocks) parent_annotations is a list as long as the number of lines in parent matching_blocks is a list of (parent_idx, text_idx, len) tuples indicating which lines match between the two texts """ parent_lines = self._text_cache[parent_key] parent_annotations = self._annotations_cache[parent_key] # PatienceSequenceMatcher should probably be part of Policy from patiencediff import PatienceSequenceMatcher matcher = PatienceSequenceMatcher(None, parent_lines, text) matching_blocks = matcher.get_matching_blocks() return parent_annotations, matching_blocks def _update_from_first_parent(self, key, annotations, lines, parent_key): """Reannotate this text relative to its first parent.""" ( parent_annotations, matching_blocks, ) = self._get_parent_annotations_and_matches(key, lines, parent_key) for parent_idx, lines_idx, match_len in matching_blocks: # For all matching regions we copy across the parent annotations annotations[lines_idx : lines_idx + match_len] = parent_annotations[ parent_idx : parent_idx + match_len ] def _update_from_other_parents( self, key, annotations, lines, this_annotation, parent_key ): """Reannotate this text relative to a second (or more) parent.""" ( parent_annotations, matching_blocks, ) = self._get_parent_annotations_and_matches(key, lines, parent_key) last_ann = None last_parent = None last_res = None # TODO: consider making all annotations unique and then using 'is' # everywhere. Current results claim that isn't any faster, # because of the time spent deduping # deduping also saves a bit of memory. For NEWS it saves ~1MB, # but that is out of 200-300MB for extracting everything, so a # fairly trivial amount for parent_idx, lines_idx, match_len in matching_blocks: # For lines which match this parent, we will now resolve whether # this parent wins over the current annotation ann_sub = annotations[lines_idx : lines_idx + match_len] par_sub = parent_annotations[parent_idx : parent_idx + match_len] if ann_sub == par_sub: continue for idx in range(match_len): ann = ann_sub[idx] par_ann = par_sub[idx] ann_idx = lines_idx + idx if ann == par_ann: # Nothing to change continue if ann == this_annotation: # Originally claimed 'this', but it was really in this # parent annotations[ann_idx] = par_ann continue # Resolve the fact that both sides have a different value for # last modified if ann == last_ann and par_ann == last_parent: annotations[ann_idx] = last_res else: new_ann = set(ann) new_ann.update(par_ann) new_ann = tuple(sorted(new_ann)) annotations[ann_idx] = new_ann last_ann = ann last_parent = par_ann last_res = new_ann def _record_annotation(self, key, parent_keys, annotations): self._annotations_cache[key] = annotations for parent_key in parent_keys: num = self._num_needed_children[parent_key] num -= 1 if num == 0: del self._text_cache[parent_key] del self._annotations_cache[parent_key] # Do we want to clean up _num_needed_children at this point as # well? self._num_needed_children[parent_key] = num def _annotate_one(self, key, text, num_lines): this_annotation = (key,) # Note: annotations will be mutated by calls to _update_from* annotations = [this_annotation] * num_lines parent_keys = self._parent_map[key] if parent_keys: self._update_from_first_parent(key, annotations, text, parent_keys[0]) for parent in parent_keys[1:]: self._update_from_other_parents( key, annotations, text, this_annotation, parent ) self._record_annotation(key, parent_keys, annotations) def add_special_text(self, key, parent_keys, text): """Add a specific text to the graph. This is used to add a text which is not otherwise present in the versioned file. (eg. a WorkingTree injecting 'current:' into the graph to annotate the edited content.) :param key: The key to use to request this text be annotated :param parent_keys: The parents of this text :param text: A string containing the content of the text """ self._parent_map[key] = parent_keys self._text_cache[key] = osutils.split_lines(text) self._heads_provider = None def annotate(self, key, pb=None): """Return annotated fulltext for the given key. :param key: A tuple defining the text to annotate :param pb: An optional progress bar to report progress on :return: ([annotations], [lines]) annotations is a list of tuples of keys, one for each line in lines each key is a possible source for the given line. lines the text of "key" as a list of lines """ for text_key, text, num_lines in self._get_needed_texts(key, pb=pb): self._annotate_one(text_key, text, num_lines) try: annotations = self._annotations_cache[key] except KeyError as exc: raise errors.RevisionNotPresent(key, self._vf) from exc return annotations, self._text_cache[key] def _get_heads_provider(self): if self._heads_provider is None: self._heads_provider = _mod_known_graph.KnownGraph(self._parent_map) return self._heads_provider def _resolve_annotation_tie(self, the_heads, line, tiebreaker): if tiebreaker is None: head = sorted(the_heads)[0] else: # Backwards compatibility, break up the heads into pairs and # resolve the result next_head = iter(the_heads) head = next(next_head) for possible_head in next_head: annotated_lines = ((head, line), (possible_head, line)) head = tiebreaker(annotated_lines)[0] return head def annotate_flat(self, key): """Determine the single-best-revision to source for each line. This is meant as a compatibility thunk to how annotate() used to work. :return: [(ann_key, line)] A list of tuples with a single annotation key for each line. """ custom_tiebreaker = _break_annotation_tie annotations, lines = self.annotate(key) out = [] heads = self._get_heads_provider().heads append = out.append for annotation, line in zip(annotations, lines, strict=False): if len(annotation) == 1: head = annotation[0] else: the_heads = heads(annotation) if len(the_heads) == 1: # get the item out of the set head = next(iter(the_heads)) else: head = self._resolve_annotation_tie( the_heads, line, custom_tiebreaker ) append((head, line)) return out bzrformats_3.4.0.orig/bzrformats/bisect_multi.py0000644000000000000000000000517715162073433017122 0ustar00# Copyright (C) 2007 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Bisection lookup multiple keys.""" __all__ = [ "bisect_multi_bytes", ] def bisect_multi_bytes(content_lookup, size, keys): """Perform bisection lookups for keys using byte based addressing. The keys are looked up via the content_lookup routine. The content_lookup routine gives bisect_multi_bytes information about where to keep looking up to find the data for the key, and bisect_multi_bytes feeds this back into the lookup function until the search is complete. The search is complete when the list of keys which have returned something other than -1 or +1 is empty. Keys which are not found are not returned to the caller. :param content_lookup: A callable that takes a list of (offset, key) pairs and returns a list of result tuples ((offset, key), result). Each result can be one of: -1: The key comes earlier in the content. False: The key is not present in the content. +1: The key comes later in the content. Any other value: A final result to return to the caller. :param size: The length of the content. :param keys: The keys to bisect for. :return: An iterator of the results. """ # possibly make this a generator, but a list meets the contract for now. result = [] delta = size // 2 search_keys = [(delta, key) for key in keys] while search_keys: search_results = content_lookup(search_keys) if delta > 1: delta = delta // 2 search_keys = [] for (location, key), status in search_results: if status == -1: search_keys.append((location - delta, key)) elif status == 1: search_keys.append((location + delta, key)) elif status is False: # not present, stop searching continue else: result.append((key, status)) return result bzrformats_3.4.0.orig/bzrformats/btree_index.py0000644000000000000000000020333215162115103016707 0ustar00# Copyright (C) 2008-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # """B+Tree indices.""" import logging import math import tempfile import zlib from io import BytesIO from . import chunk_writer, lru_cache, osutils from . import index as _mod_index from .index import _OPTION_KEY_ELEMENTS, _OPTION_LEN, _OPTION_NODE_REFS from .lru_cache import FIFOCache _BTSIGNATURE = b"B+Tree Graph Index 2\n" _OPTION_ROW_LENGTHS = b"row_lengths=" _LEAF_FLAG = b"type=leaf\n" _INTERNAL_FLAG = b"type=internal\n" _INTERNAL_OFFSET = b"offset=" _RESERVED_HEADER_BYTES = 120 _PAGE_SIZE = 4096 # 4K per page: 4MB - 1000 entries _NODE_CACHE_SIZE = 1000 logger = logging.getLogger(name="bzrformats.btree_index") evil_logger = logging.getLogger(name="bzrformats.evil") class _BuilderRow: """The stored state accumulated while writing out a row in the index. :ivar spool: A temporary file used to accumulate nodes for this row in the tree. :ivar nodes: The count of nodes emitted so far. """ def __init__(self): """Create a _BuilderRow.""" self.nodes = 0 self.spool = None # tempfile.TemporaryFile(prefix='bzr-index-row-') self.writer = None def finish_node(self, pad=True): byte_lines, _, padding = self.writer.finish() if self.nodes == 0: self.spool = BytesIO() # padded note: self.spool.write(b"\x00" * _RESERVED_HEADER_BYTES) elif self.nodes == 1: # We got bigger than 1 node, switch to a temp file spool = tempfile.TemporaryFile(prefix="bzr-index-row-") spool.write(self.spool.getvalue()) self.spool = spool skipped_bytes = 0 if not pad and padding: del byte_lines[-1] skipped_bytes = padding self.spool.writelines(byte_lines) remainder = (self.spool.tell() + skipped_bytes) % _PAGE_SIZE if remainder != 0: raise AssertionError( "incorrect node length: %d, %d" % (self.spool.tell(), remainder) ) self.nodes += 1 self.writer = None class _InternalBuilderRow(_BuilderRow): """The stored state accumulated while writing out internal rows.""" def finish_node(self, pad=True): if not pad: raise AssertionError("Must pad internal nodes only.") _BuilderRow.finish_node(self) class _LeafBuilderRow(_BuilderRow): """The stored state accumulated while writing out a leaf rows.""" class BTreeBuilder(_mod_index.GraphIndexBuilder): """A Builder for B+Tree based Graph indices. The resulting graph has the structure: _SIGNATURE OPTIONS NODES _SIGNATURE := 'B+Tree Graph Index 1' NEWLINE OPTIONS := REF_LISTS KEY_ELEMENTS LENGTH REF_LISTS := 'node_ref_lists=' DIGITS NEWLINE KEY_ELEMENTS := 'key_elements=' DIGITS NEWLINE LENGTH := 'len=' DIGITS NEWLINE ROW_LENGTHS := 'row_lengths' DIGITS (COMMA DIGITS)* NODES := NODE_COMPRESSED* NODE_COMPRESSED:= COMPRESSED_BYTES{4096} NODE_RAW := INTERNAL | LEAF INTERNAL := INTERNAL_FLAG POINTERS LEAF := LEAF_FLAG ROWS KEY_ELEMENT := Not-whitespace-utf8 KEY := KEY_ELEMENT (NULL KEY_ELEMENT)* ROWS := ROW* ROW := KEY NULL ABSENT? NULL REFERENCES NULL VALUE NEWLINE ABSENT := 'a' REFERENCES := REFERENCE_LIST (TAB REFERENCE_LIST){node_ref_lists - 1} REFERENCE_LIST := (REFERENCE (CR REFERENCE)*)? REFERENCE := KEY VALUE := no-newline-no-null-bytes """ def __init__(self, reference_lists=0, key_elements=1, spill_at=100000): """See GraphIndexBuilder.__init__. :param spill_at: Optional parameter controlling the maximum number of nodes that BTreeBuilder will hold in memory. """ _mod_index.GraphIndexBuilder.__init__( self, reference_lists=reference_lists, key_elements=key_elements ) self._spill_at = spill_at self._backing_indices = [] # A map of {key: (node_refs, value)} self._nodes = {} # Indicate it hasn't been built yet self._nodes_by_key = None self._optimize_for_size = False def add_node(self, key, value, references=()): r"""Add a node to the index. If adding the node causes the builder to reach its spill_at threshold, disk spilling will be triggered. :param key: The key. keys are non-empty tuples containing as many whitespace-free utf8 bytestrings as the key length defined for this index. :param references: An iterable of iterables of keys. Each is a reference to another key. :param value: The value to associate with the key. It may be any bytes as long as it does not contain \0 or \n. """ # Ensure that 'key' is a tuple. key = tuple(key) # we don't care about absent_references node_refs, _ = self._check_key_ref_value(key, references, value) if key in self._nodes: raise _mod_index.BadIndexDuplicateKey(key, self) self._nodes[key] = (node_refs, value) if self._nodes_by_key is not None and self._key_length > 1: self._update_nodes_by_key(key, value, node_refs) if len(self._nodes) < self._spill_at: return self._spill_mem_keys_to_disk() def _spill_mem_keys_to_disk(self): """Write the in memory keys down to disk to cap memory consumption. If we already have some keys written to disk, we will combine them so as to preserve the sorted order. The algorithm for combining uses powers of two. So on the first spill, write all mem nodes into a single index. On the second spill, combine the mem nodes with the nodes on disk to create a 2x sized disk index and get rid of the first index. On the third spill, create a single new disk index, which will contain the mem nodes, and preserve the existing 2x sized index. On the fourth, combine mem with the first and second indexes, creating a new one of size 4x. On the fifth create a single new one, etc. """ if self._combine_backing_indices: (new_backing_file, size, backing_pos) = self._spill_mem_keys_and_combine() else: new_backing_file, size = self._spill_mem_keys_without_combining() # The transport isn't used because we override _file directly below class _DummyTransport: def recommended_page_size(self): return 4096 new_backing = BTreeGraphIndex(_DummyTransport(), "", size) # GC will clean up the file new_backing._file = new_backing_file if self._combine_backing_indices: if len(self._backing_indices) == backing_pos: self._backing_indices.append(None) self._backing_indices[backing_pos] = new_backing for backing_pos in range(backing_pos): # noqa: B020 self._backing_indices[backing_pos] = None else: self._backing_indices.append(new_backing) self._nodes = {} self._nodes_by_key = None def _spill_mem_keys_without_combining(self): return self._write_nodes(self._iter_mem_nodes(), allow_optimize=False) def _spill_mem_keys_and_combine(self): iterators_to_combine = [self._iter_mem_nodes()] pos = -1 for pos, backing in enumerate(self._backing_indices): if backing is None: pos -= 1 break iterators_to_combine.append(backing.iter_all_entries()) backing_pos = pos + 1 new_backing_file, size = self._write_nodes( self._iter_smallest(iterators_to_combine), allow_optimize=False ) return new_backing_file, size, backing_pos def add_nodes(self, nodes): """Add nodes to the index. :param nodes: An iterable of (key, node_refs, value) entries to add. """ if self.reference_lists: for key, value, node_refs in nodes: self.add_node(key, value, node_refs) else: for key, value in nodes: self.add_node(key, value) def _iter_mem_nodes(self): """Iterate over the nodes held in memory.""" nodes = self._nodes if self.reference_lists: for key in sorted(nodes): references, value = nodes[key] yield self, key, value, references else: for key in sorted(nodes): references, value = nodes[key] yield self, key, value def _iter_smallest(self, iterators_to_combine): if len(iterators_to_combine) == 1: yield from iterators_to_combine[0] return current_values = [] for iterator in iterators_to_combine: try: current_values.append(next(iterator)) except StopIteration: current_values.append(None) last = None while True: # Decorate candidates with the value to allow 2.4's min to be used. candidates = [ (item[1][1], item) for item in enumerate(current_values) if item[1] is not None ] if not len(candidates): return selected = min(candidates) # undecorate back to (pos, node) selected = selected[1] if last == selected[1][1]: raise _mod_index.BadIndexDuplicateKey(last, self) last = selected[1][1] # Yield, with self as the index yield (self,) + selected[1][1:] pos = selected[0] try: current_values[pos] = next(iterators_to_combine[pos]) except StopIteration: current_values[pos] = None def _add_key(self, string_key, line, rows, allow_optimize=True): """Add a key to the current chunk. :param string_key: The key to add. :param line: The fully serialised key and value. :param allow_optimize: If set to False, prevent setting the optimize flag when writing out. This is used by the _spill_mem_keys_to_disk functionality. """ new_leaf = False if rows[-1].writer is None: # opening a new leaf chunk; new_leaf = True for pos, internal_row in enumerate(rows[:-1]): # flesh out any internal nodes that are needed to # preserve the height of the tree if internal_row.writer is None: length = _PAGE_SIZE if internal_row.nodes == 0: length -= _RESERVED_HEADER_BYTES # padded if allow_optimize: optimize_for_size = self._optimize_for_size else: optimize_for_size = False internal_row.writer = chunk_writer.ChunkWriter( length, 0, optimize_for_size=optimize_for_size ) internal_row.writer.write(_INTERNAL_FLAG) internal_row.writer.write( _INTERNAL_OFFSET + b"%d\n" % rows[pos + 1].nodes ) # add a new leaf length = _PAGE_SIZE if rows[-1].nodes == 0: length -= _RESERVED_HEADER_BYTES # padded rows[-1].writer = chunk_writer.ChunkWriter( length, optimize_for_size=self._optimize_for_size ) rows[-1].writer.write(_LEAF_FLAG) if rows[-1].writer.write(line): # if we failed to write, despite having an empty page to write to, # then line is too big. raising the error avoids infinite recursion # searching for a suitably large page that will not be found. if new_leaf: raise _mod_index.BadIndexKey(string_key) # this key did not fit in the node: rows[-1].finish_node() key_line = string_key + b"\n" new_row = True for row in reversed(rows[:-1]): # Mark the start of the next node in the node above. If it # doesn't fit then propagate upwards until we find one that # it does fit into. if row.writer.write(key_line): row.finish_node() else: # We've found a node that can handle the pointer. new_row = False break # If we reached the current root without being able to mark the # division point, then we need a new root: if new_row: # We need a new row logger.debug("Inserting new global row.") new_row = _InternalBuilderRow() reserved_bytes = 0 rows.insert(0, new_row) # This will be padded, hence the -100 new_row.writer = chunk_writer.ChunkWriter( _PAGE_SIZE - _RESERVED_HEADER_BYTES, reserved_bytes, optimize_for_size=self._optimize_for_size, ) new_row.writer.write(_INTERNAL_FLAG) new_row.writer.write(_INTERNAL_OFFSET + b"%d\n" % (rows[1].nodes - 1)) new_row.writer.write(key_line) self._add_key(string_key, line, rows, allow_optimize=allow_optimize) def _write_nodes(self, node_iterator, allow_optimize=True): """Write node_iterator out as a B+Tree. :param node_iterator: An iterator of sorted nodes. Each node should match the output given by iter_all_entries. :param allow_optimize: If set to False, prevent setting the optimize flag when writing out. This is used by the _spill_mem_keys_to_disk functionality. :return: A file handle for a temporary file containing a B+Tree for the nodes. """ # The index rows - rows[0] is the root, rows[1] is the layer under it # etc. rows = [] # forward sorted by key. In future we may consider topological sorting, # at the cost of table scans for direct lookup, or a second index for # direct lookup key_count = 0 # A stack with the number of nodes of each size. 0 is the root node # and must always be 1 (if there are any nodes in the tree). self.row_lengths = [] # Loop over all nodes adding them to the bottom row # (rows[-1]). When we finish a chunk in a row, # propagate the key that didn't fit (comes after the chunk) to the # row above, transitively. for node in node_iterator: if key_count == 0: # First key triggers the first row rows.append(_LeafBuilderRow()) key_count += 1 string_key, line = _btree_serializer._flatten_node( node, self.reference_lists ) self._add_key(string_key, line, rows, allow_optimize=allow_optimize) for row in reversed(rows): pad = not isinstance(row, _LeafBuilderRow) row.finish_node(pad=pad) lines = [_BTSIGNATURE] lines.append(b"%s%d\n" % (_OPTION_NODE_REFS, self.reference_lists)) lines.append(b"%s%d\n" % (_OPTION_KEY_ELEMENTS, self._key_length)) lines.append(b"%s%d\n" % (_OPTION_LEN, key_count)) row_lengths = [row.nodes for row in rows] lines.append( _OPTION_ROW_LENGTHS + ",".join(map(str, row_lengths)).encode("ascii") + b"\n" ) if row_lengths and row_lengths[-1] > 1: result = tempfile.NamedTemporaryFile(prefix="bzr-index-") else: result = BytesIO() result.writelines(lines) position = sum(map(len, lines)) if position > _RESERVED_HEADER_BYTES: raise AssertionError( "Could not fit the header in the" " reserved space: %d > %d" % (position, _RESERVED_HEADER_BYTES) ) # write the rows out: for row in rows: reserved = _RESERVED_HEADER_BYTES # reserved space for first node row.spool.flush() row.spool.seek(0) # copy nodes to the finalised file. # Special case the first node as it may be prefixed node = row.spool.read(_PAGE_SIZE) result.write(node[reserved:]) if len(node) == _PAGE_SIZE: result.write(b"\x00" * (reserved - position)) position = 0 # Only the root row actually has an offset copied_len = osutils.pumpfile(row.spool, result) if copied_len != (row.nodes - 1) * _PAGE_SIZE: if not isinstance(row, _LeafBuilderRow): raise AssertionError( "Incorrect amount of data copied" " expected: %d, got: %d" % ((row.nodes - 1) * _PAGE_SIZE, copied_len) ) result.flush() size = result.tell() result.seek(0) return result, size def finish(self): """Finalise the index. :return: A file handle for a temporary file containing the nodes added to the index. """ return self._write_nodes(self.iter_all_entries())[0] def iter_all_entries(self): """Iterate over all keys within the index. :return: An iterable of (index, key, value, reference_lists). There is no defined order for the result iteration - it will be in the most efficient order for the index (in this case dictionary hash order). """ evil_logger.debug("iter_all_entries scales with size of history.") # Doing serial rather than ordered would be faster; but this shouldn't # be getting called routinely anyway. iterators = [self._iter_mem_nodes()] for backing in self._backing_indices: if backing is not None: iterators.append(backing.iter_all_entries()) if len(iterators) == 1: return iterators[0] return self._iter_smallest(iterators) def iter_entries(self, keys): """Iterate over keys within the index. :param keys: An iterable providing the keys to be retrieved. :return: An iterable of (index, key, value, reference_lists). There is no defined order for the result iteration - it will be in the most efficient order for the index (keys iteration order in this case). """ keys = set(keys) # Note: We don't use keys.intersection() here. If you read the C api, # set.intersection(other) special cases when other is a set and # will iterate the smaller of the two and lookup in the other. # It does *not* do this for any other type (even dict, unlike # some other set functions.) Since we expect keys is generally << # self._nodes, it is faster to iterate over it in a list # comprehension nodes = self._nodes local_keys = [key for key in keys if key in nodes] if self.reference_lists: for key in local_keys: node = nodes[key] yield self, key, node[1], node[0] else: for key in local_keys: node = nodes[key] yield self, key, node[1] # Find things that are in backing indices that have not been handled # yet. if not self._backing_indices: return # We won't find anything there either # Remove all of the keys that we found locally keys.difference_update(local_keys) for backing in self._backing_indices: if backing is None: continue if not keys: return for node in backing.iter_entries(keys): keys.remove(node[1]) yield (self,) + node[1:] def iter_entries_prefix(self, keys): """Iterate over keys within the index using prefix matching. Prefix matching is applied within the tuple of a key, not to within the bytestring of each key element. e.g. if you have the keys ('foo', 'bar'), ('foobar', 'gam') and do a prefix search for ('foo', None) then only the former key is returned. :param keys: An iterable providing the key prefixes to be retrieved. Each key prefix takes the form of a tuple the length of a key, but with the last N elements 'None' rather than a regular bytestring. The first element cannot be 'None'. :return: An iterable as per iter_all_entries, but restricted to the keys with a matching prefix to those supplied. No additional keys will be returned, and every match that is in the index will be returned. """ keys = set(keys) if not keys: return for backing in self._backing_indices: if backing is None: continue for node in backing.iter_entries_prefix(keys): yield (self,) + node[1:] if self._key_length == 1: for key in keys: _mod_index._sanity_check_key(self, key) try: node = self._nodes[key] except KeyError: continue if self.reference_lists: yield self, key, node[1], node[0] else: yield self, key, node[1] return nodes_by_key = self._get_nodes_by_key() yield from _mod_index._iter_entries_prefix(self, nodes_by_key, keys) def _get_nodes_by_key(self): if self._nodes_by_key is None: nodes_by_key = {} if self.reference_lists: for key, (references, value) in self._nodes.items(): key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key, value, references else: for key, (references, value) in self._nodes.items(): # noqa: B007 key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key, value self._nodes_by_key = nodes_by_key return self._nodes_by_key def key_count(self): """Return an estimate of the number of keys in this index. For InMemoryGraphIndex the estimate is exact. """ return len(self._nodes) + sum( backing.key_count() for backing in self._backing_indices if backing is not None ) def validate(self): """In memory index's have no known corruption at the moment.""" def __lt__(self, other): """Compare with another index for sorting.""" if isinstance(other, type(self)): return self._nodes < other._nodes # Always sort existing indexes before ones that are still being built. if isinstance(other, BTreeGraphIndex): return False raise TypeError class _LeafNode(dict): """A leaf node for a serialised B+Tree index.""" __slots__ = ("_keys", "max_key", "min_key") def __init__(self, bytes, key_length, ref_list_length): """Parse bytes to create a leaf node object.""" # splitlines mangles the \r delimiters.. don't use it. key_list = _btree_serializer._parse_leaf_lines( bytes, key_length, ref_list_length ) if key_list: self.min_key = key_list[0][0] self.max_key = key_list[-1][0] else: self.min_key = self.max_key = None super().__init__(key_list) self._keys = dict(self) def all_items(self): """Return a sorted list of (key, (value, refs)) items.""" items = sorted(self.items()) return items def all_keys(self): """Return a sorted list of all keys.""" keys = sorted(self.keys()) return keys class _InternalNode: """An internal node for a serialised B+Tree index.""" __slots__ = ("keys", "offset") def __init__(self, bytes): """Parse bytes to create an internal node object.""" # splitlines mangles the \r delimiters.. don't use it. self.keys = self._parse_lines(bytes.split(b"\n")) def _parse_lines(self, lines): nodes = [] self.offset = int(lines[1][7:]) for line in lines[2:]: if line == b"": break nodes.append(tuple(line.split(b"\0"))) return nodes class BTreeGraphIndex: """Access to nodes via the standard GraphIndex interface for B+Tree's. Individual nodes are held in a LRU cache. This holds the root node in memory except when very large walks are done. """ def __init__(self, transport, name, size, unlimited_cache=False, offset=0): """Create a B+Tree index object on the index name. :param transport: The transport to read data for the index from. :param name: The file name of the index on transport. :param size: Optional size of the index in bytes. This allows compatibility with the GraphIndex API, as well as ensuring that the initial read (to read the root node header) can be done without over-reading even on empty indices, and on small indices allows single-IO to read the entire index. :param unlimited_cache: If set to True, then instead of using an LRUCache with size _NODE_CACHE_SIZE, we will use a dict and always cache all leaf nodes. :param offset: The start of the btree index data isn't byte 0 of the file. Instead it starts at some point later. """ self._transport = transport self._name = name self._size = size self._file = None self._recommended_pages = self._compute_recommended_pages() self._root_node = None self._base_offset = offset self._leaf_factory = _LeafNode # Default max size is 100,000 leave values self._leaf_value_cache = None # lru_cache.LRUCache(100*1000) if unlimited_cache: self._leaf_node_cache = {} self._internal_node_cache = {} else: self._leaf_node_cache = lru_cache.LRUCache(_NODE_CACHE_SIZE) # We use a FIFO here just to prevent possible blowout. However, a # 300k record btree has only 3k leaf nodes, and only 20 internal # nodes. A value of 100 scales to ~100*100*100 = 1M records. self._internal_node_cache = FIFOCache(100) self._key_count = None self._row_lengths = None self._row_offsets = None # Start of each row, [-1] is the end def __hash__(self): """Return hash based on object identity.""" return id(self) def __eq__(self, other): """Equal when self and other were created with the same parameters.""" return ( isinstance(self, type(other)) and self._transport == other._transport and self._name == other._name and self._size == other._size ) def __lt__(self, other): """Compare with another index for sorting by name and size.""" if isinstance(other, type(self)): return (self._name, self._size) < (other._name, other._size) # Always sort existing indexes before ones that are still being built. if isinstance(other, BTreeBuilder): return True raise TypeError def __ne__(self, other): """Return True if not equal to other.""" return not self.__eq__(other) def _get_and_cache_nodes(self, nodes): """Read nodes and cache them in the lru. The nodes list supplied is sorted and then read from disk, each node being inserted it into the _node_cache. Note: Asking for more nodes than the _node_cache can contain will result in some of the results being immediately discarded, to prevent this an assertion is raised if more nodes are asked for than are cachable. :return: A dict of {node_pos: node} """ found = {} start_of_leaves = None for node_pos, node in self._read_nodes(sorted(nodes)): if node_pos == 0: # Special case self._root_node = node else: if start_of_leaves is None: start_of_leaves = self._row_offsets[-2] if node_pos < start_of_leaves: self._internal_node_cache[node_pos] = node else: self._leaf_node_cache[node_pos] = node found[node_pos] = node return found def _compute_recommended_pages(self): """Convert transport's recommended_page_size into btree pages. recommended_page_size is in bytes, we want to know how many _PAGE_SIZE pages fit in that length. """ recommended_read = self._transport.recommended_page_size() recommended_pages = math.ceil(recommended_read / _PAGE_SIZE) return recommended_pages def _compute_total_pages_in_index(self): """How many pages are in the index. If we have read the header we will use the value stored there. Otherwise it will be computed based on the length of the index. """ if self._size is None: raise AssertionError( "_compute_total_pages_in_index should not be" " called when self._size is None" ) if self._root_node is not None: # This is the number of pages as defined by the header return self._row_offsets[-1] # This is the number of pages as defined by the size of the index. They # should be indentical. total_pages = math.ceil(self._size / _PAGE_SIZE) return total_pages def _expand_offsets(self, offsets): """Find extra pages to download. The idea is that we always want to make big-enough requests (like 64kB for http), so that we don't waste round trips. So given the entries that we already have cached and the new pages being downloaded figure out what other pages we might want to read. See also doc/developers/btree_index_prefetch.txt for more details. :param offsets: The offsets to be read :return: A list of offsets to download """ logger.debug("expanding: %s\toffsets: %s", self._name, offsets) if len(offsets) >= self._recommended_pages: # Don't add more, we are already requesting more than enough logger.debug( " not expanding large request (%s >= %s)", len(offsets), self._recommended_pages, ) return offsets if self._size is None: # Don't try anything, because we don't know where the file ends logger.debug(" not expanding without knowing index size") return offsets total_pages = self._compute_total_pages_in_index() cached_offsets = self._get_offsets_to_cached_pages() # If reading recommended_pages would read the rest of the index, just # do so. if total_pages - len(cached_offsets) <= self._recommended_pages: # Read whatever is left if cached_offsets: expanded = [x for x in range(total_pages) if x not in cached_offsets] else: expanded = list(range(total_pages)) evil_logger.debug(" reading all unread pages: %s", expanded) return expanded if self._root_node is None: # ATM on the first read of the root node of a large index, we don't # bother pre-reading any other pages. This is because the # likelyhood of actually reading interesting pages is very low. # See doc/developers/btree_index_prefetch.txt for a discussion, and # a possible implementation when we are guessing that the second # layer index is small final_offsets = offsets else: tree_depth = len(self._row_lengths) if len(cached_offsets) < tree_depth and len(offsets) == 1: # We haven't read enough to justify expansion # If we are only going to read the root node, and 1 leaf node, # then it isn't worth expanding our request. Once we've read at # least 2 nodes, then we are probably doing a search, and we # start expanding our requests. logger.debug(" not expanding on first reads") return offsets final_offsets = self._expand_to_neighbors( offsets, cached_offsets, total_pages ) final_offsets = sorted(final_offsets) logger.debug("expanded: %s", final_offsets) return final_offsets def _expand_to_neighbors(self, offsets, cached_offsets, total_pages): """Expand requests to neighbors until we have enough pages. This is called from _expand_offsets after policy has determined that we want to expand. We only want to expand requests within a given layer. We cheat a little bit and assume all requests will be in the same layer. This is true given the current design, but if it changes this algorithm may perform oddly. :param offsets: requested offsets :param cached_offsets: offsets for pages we currently have cached :return: A set() of offsets after expansion """ final_offsets = set(offsets) first = end = None new_tips = set(final_offsets) while len(final_offsets) < self._recommended_pages and new_tips: next_tips = set() for pos in new_tips: if first is None: first, end = self._find_layer_first_and_end(pos) previous = pos - 1 if ( previous > 0 and previous not in cached_offsets and previous not in final_offsets and previous >= first ): next_tips.add(previous) after = pos + 1 if ( after < total_pages and after not in cached_offsets and after not in final_offsets and after < end ): next_tips.add(after) # This would keep us from going bigger than # recommended_pages by only expanding the first offsets. # However, if we are making a 'wide' request, it is # reasonable to expand all points equally. # if len(final_offsets) > recommended_pages: # break final_offsets.update(next_tips) new_tips = next_tips return final_offsets def clear_cache(self): """Clear out any cached/memoized values. This can be called at any time, but generally it is used when we have extracted some information, but don't expect to be requesting any more from this index. """ # Note that we don't touch self._root_node or self._internal_node_cache # We don't expect either of those to be big, and it can save # round-trips in the future. We may re-evaluate this if InternalNode # memory starts to be an issue. self._leaf_node_cache.clear() def external_references(self, ref_list_num): """Return external references from the specified reference list.""" if self._root_node is None: self._get_root_node() if ref_list_num + 1 > self.node_ref_lists: raise ValueError( "No ref list %d, index has %d ref lists" % (ref_list_num, self.node_ref_lists) ) keys = set() refs = set() for node in self.iter_all_entries(): keys.add(node[1]) refs.update(node[3][ref_list_num]) return refs - keys def _find_layer_first_and_end(self, offset): """Find the start/stop nodes for the layer corresponding to offset. :return: (first, end) first is the first node in this layer end is the first node of the next layer """ first = end = 0 for roffset in self._row_offsets: first = end end = roffset if offset < roffset: break return first, end def _get_offsets_to_cached_pages(self): """Determine what nodes we already have cached.""" cached_offsets = set(self._internal_node_cache) # cache may be dict or LRUCache, keys() is the common method cached_offsets.update(self._leaf_node_cache.keys()) if self._root_node is not None: cached_offsets.add(0) return cached_offsets def _get_root_node(self): if self._root_node is None: # We may not have a root node yet self._get_internal_nodes([0]) return self._root_node def _get_nodes(self, cache, node_indexes): found = {} needed = [] for idx in node_indexes: if idx == 0 and self._root_node is not None: found[0] = self._root_node continue try: found[idx] = cache[idx] except KeyError: needed.append(idx) if not needed: return found needed = self._expand_offsets(needed) found.update(self._get_and_cache_nodes(needed)) return found def _get_internal_nodes(self, node_indexes): """Get a node, from cache or disk. After getting it, the node will be cached. """ return self._get_nodes(self._internal_node_cache, node_indexes) def _cache_leaf_values(self, nodes): """Cache directly from key => value, skipping the btree.""" if self._leaf_value_cache is not None: for node in nodes.values(): for key, value in node.all_items(): if key in self._leaf_value_cache: # Don't add the rest of the keys, we've seen this node # before. break self._leaf_value_cache[key] = value def _get_leaf_nodes(self, node_indexes): """Get a bunch of nodes, from cache or disk.""" found = self._get_nodes(self._leaf_node_cache, node_indexes) self._cache_leaf_values(found) return found def iter_all_entries(self): """Iterate over all keys within the index. :return: An iterable of (index, key, value) or (index, key, value, reference_lists). The former tuple is used when there are no reference lists in the index, making the API compatible with simple key:value index types. There is no defined order for the result iteration - it will be in the most efficient order for the index. """ evil_logger.debug("iter_all_entries scales with size of history.") if not self.key_count(): return if self._row_offsets[-1] == 1: # There is only the root node, and we read that via key_count() if self.node_ref_lists: for key, (value, refs) in self._root_node.all_items(): yield (self, key, value, refs) else: for key, (value, refs) in self._root_node.all_items(): # noqa: B007 yield (self, key, value) return start_of_leaves = self._row_offsets[-2] end_of_leaves = self._row_offsets[-1] needed_offsets = list(range(start_of_leaves, end_of_leaves)) if needed_offsets == [0]: # Special case when we only have a root node, as we have already # read everything nodes = [(0, self._root_node)] else: nodes = self._read_nodes(needed_offsets) # We iterate strictly in-order so that we can use this function # for spilling index builds to disk. if self.node_ref_lists: for _, node in nodes: for key, (value, refs) in node.all_items(): yield (self, key, value, refs) else: for _, node in nodes: for key, (value, refs) in node.all_items(): # noqa: B007 yield (self, key, value) @staticmethod def _multi_bisect_right(in_keys, fixed_keys): """Find the positions where each 'in_key' would fit in fixed_keys. This is equivalent to doing "bisect_right" on each in_key into fixed_keys :param in_keys: A sorted list of keys to match with fixed_keys :param fixed_keys: A sorted list of keys to match against :return: A list of (integer position, [key list]) tuples. """ import bisect if not in_keys: return [] if not fixed_keys: # no pointers in the fixed_keys list, which means everything must # fall to the left. return [(0, in_keys)] # TODO: Iterating both lists will generally take M + N steps # Bisecting each key will generally take M * log2 N steps. # If we had an efficient way to compare, we could pick the method # based on which has the fewer number of steps. # There is also the argument that bisect_right is a compiled # function, so there is even more to be gained. # iter_steps = len(in_keys) + len(fixed_keys) # bisect_steps = len(in_keys) * math.log(len(fixed_keys), 2) if len(in_keys) == 1: # Bisect will always be faster for M = 1 return [(bisect.bisect_right(fixed_keys, in_keys[0]), in_keys)] # elif bisect_steps < iter_steps: # offsets = {} # for key in in_keys: # offsets.setdefault(bisect_right(fixed_keys, key), # []).append(key) # return [(o, offsets[o]) for o in sorted(offsets)] in_keys_iter = iter(in_keys) fixed_keys_iter = enumerate(fixed_keys) cur_in_key = next(in_keys_iter) cur_fixed_offset, cur_fixed_key = next(fixed_keys_iter) class InputDone(Exception): pass class FixedDone(Exception): pass output = [] cur_out = [] # TODO: Another possibility is that rather than iterating on each side, # we could use a combination of bisecting and iterating. For # example, while cur_in_key < fixed_key, bisect to find its # point, then iterate all matching keys, then bisect (restricted # to only the remainder) for the next one, etc. try: while True: if cur_in_key < cur_fixed_key: cur_keys = [] cur_out = (cur_fixed_offset, cur_keys) output.append(cur_out) while cur_in_key < cur_fixed_key: cur_keys.append(cur_in_key) try: cur_in_key = next(in_keys_iter) except StopIteration as exc: raise InputDone from exc # At this point cur_in_key must be >= cur_fixed_key # step the cur_fixed_key until we pass the cur key, or walk off # the end while cur_in_key >= cur_fixed_key: try: cur_fixed_offset, cur_fixed_key = next(fixed_keys_iter) except StopIteration as exc: raise FixedDone from exc except InputDone: # We consumed all of the input, nothing more to do pass except FixedDone: # There was some input left, but we consumed all of fixed, so we # have to add one more for the tail cur_keys = [cur_in_key] cur_keys.extend(in_keys_iter) cur_out = (len(fixed_keys), cur_keys) output.append(cur_out) return output def _walk_through_internal_nodes(self, keys): """Take the given set of keys, and find the corresponding LeafNodes. :param keys: An unsorted iterable of keys to search for :return: (nodes, index_and_keys) nodes is a dict mapping {index: LeafNode} keys_at_index is a list of tuples of [(index, [keys for Leaf])] """ # 6 seconds spent in miss_torture using the sorted() line. # Even with out of order disk IO it seems faster not to sort it when # large queries are being made. keys_at_index = [(0, sorted(keys))] for _row_pos, next_row_start in enumerate(self._row_offsets[1:-1]): node_indexes = [idx for idx, s_keys in keys_at_index] nodes = self._get_internal_nodes(node_indexes) next_nodes_and_keys = [] for node_index, sub_keys in keys_at_index: node = nodes[node_index] positions = self._multi_bisect_right(sub_keys, node.keys) node_offset = next_row_start + node.offset next_nodes_and_keys.extend( [(node_offset + pos, s_keys) for pos, s_keys in positions] ) keys_at_index = next_nodes_and_keys # We should now be at the _LeafNodes node_indexes = [idx for idx, s_keys in keys_at_index] # TODO: We may *not* want to always read all the nodes in one # big go. Consider setting a max size on this. nodes = self._get_leaf_nodes(node_indexes) return nodes, keys_at_index def iter_entries(self, keys): """Iterate over keys within the index. :param keys: An iterable providing the keys to be retrieved. :return: An iterable as per iter_all_entries, but restricted to the keys supplied. No additional keys will be returned, and every key supplied that is in the index will be returned. """ # 6 seconds spent in miss_torture using the sorted() line. # Even with out of order disk IO it seems faster not to sort it when # large queries are being made. # However, now that we are doing multi-way bisecting, we need the keys # in sorted order anyway. We could change the multi-way code to not # require sorted order. (For example, it bisects for the first node, # does an in-order search until a key comes before the current point, # which it then bisects for, etc.) keys = frozenset(keys) if not keys: return if not self.key_count(): return needed_keys = [] if self._leaf_value_cache is None: needed_keys = keys else: for key in keys: value = self._leaf_value_cache.get(key, None) if value is not None: # This key is known not to be here, skip it value, refs = value if self.node_ref_lists: yield (self, key, value, refs) else: yield (self, key, value) else: needed_keys.append(key) needed_keys = keys if not needed_keys: return nodes, nodes_and_keys = self._walk_through_internal_nodes(needed_keys) for node_index, sub_keys in nodes_and_keys: if not sub_keys: continue node = nodes[node_index] for next_sub_key in sub_keys: if next_sub_key in node: value, refs = node[next_sub_key] if self.node_ref_lists: yield (self, next_sub_key, value, refs) else: yield (self, next_sub_key, value) def _find_ancestors(self, keys, ref_list_num, parent_map, missing_keys): """Find the parent_map information for the set of keys. This populates the parent_map dict and missing_keys set based on the queried keys. It also can fill out an arbitrary number of parents that it finds while searching for the supplied keys. It is unlikely that you want to call this directly. See "CombinedGraphIndex.find_ancestry()" for a more appropriate API. :param keys: A keys whose ancestry we want to return Every key will either end up in 'parent_map' or 'missing_keys'. :param ref_list_num: This index in the ref_lists is the parents we care about. :param parent_map: {key: parent_keys} for keys that are present in this index. This may contain more entries than were in 'keys', that are reachable ancestors of the keys requested. :param missing_keys: keys which are known to be missing in this index. This may include parents that were not directly requested, but we were able to determine that they are not present in this index. :return: search_keys parents that were found but not queried to know if they are missing or present. Callers can re-query this index for those keys, and they will be placed into parent_map or missing_keys """ if not self.key_count(): # We use key_count() to trigger reading the root node and # determining info about this BTreeGraphIndex # If we don't have any keys, then everything is missing missing_keys.update(keys) return set() if ref_list_num >= self.node_ref_lists: raise ValueError( "No ref list %d, index has %d ref lists" % (ref_list_num, self.node_ref_lists) ) # The main trick we are trying to accomplish is that when we find a # key listing its parents, we expect that the parent key is also likely # to sit on the same page. Allowing us to expand parents quickly # without suffering the full stack of bisecting, etc. nodes, nodes_and_keys = self._walk_through_internal_nodes(keys) # These are parent keys which could not be immediately resolved on the # page where the child was present. Note that we may already be # searching for that key, and it may actually be present [or known # missing] on one of the other pages we are reading. # TODO: # We could try searching for them in the immediate previous or next # page. If they occur "later" we could put them in a pending lookup # set, and then for each node we read thereafter we could check to # see if they are present. # However, we don't know the impact of keeping this list of things # that I'm going to search for every node I come across from here on # out. # It doesn't handle the case when the parent key is missing on a # page that we *don't* read. So we already have to handle being # re-entrant for that. # Since most keys contain a date string, they are more likely to be # found earlier in the file than later, but we would know that right # away (key < min_key), and wouldn't keep searching it on every other # page that we read. # Mostly, it is an idea, one which should be benchmarked. parents_not_on_page = set() for node_index, sub_keys in nodes_and_keys: if not sub_keys: continue # sub_keys is all of the keys we are looking for that should exist # on this page, if they aren't here, then they won't be found node = nodes[node_index] parents_to_check = set() for next_sub_key in sub_keys: if next_sub_key not in node: # This one is just not present in the index at all missing_keys.add(next_sub_key) else: _value, refs = node[next_sub_key] parent_keys = refs[ref_list_num] parent_map[next_sub_key] = parent_keys parents_to_check.update(parent_keys) # Don't look for things we've already found parents_to_check = parents_to_check.difference(parent_map) # this can be used to test the benefit of having the check loop # inlined. # parents_not_on_page.update(parents_to_check) # continue while parents_to_check: next_parents_to_check = set() for key in parents_to_check: if key in node: _value, refs = node[key] parent_keys = refs[ref_list_num] parent_map[key] = parent_keys next_parents_to_check.update(parent_keys) else: # This parent either is genuinely missing, or should be # found on another page. Perf test whether it is better # to check if this node should fit on this page or not. # in the 'everything-in-one-pack' scenario, this *not* # doing the check is 237ms vs 243ms. # So slightly better, but I assume the standard 'lots # of packs' is going to show a reasonable improvement # from the check, because it avoids 'going around # again' for everything that is in another index # parents_not_on_page.add(key) # Missing for some reason if key < node.min_key: # in the case of bzr.dev, 3.4k/5.3k misses are # 'earlier' misses (65%) parents_not_on_page.add(key) elif key > node.max_key: # This parent key would be present on a different # LeafNode parents_not_on_page.add(key) else: # assert (key != node.min_key and # key != node.max_key) # If it was going to be present, it would be on # *this* page, so mark it missing. missing_keys.add(key) parents_to_check = next_parents_to_check.difference(parent_map) # Might want to do another .difference() from missing_keys # parents_not_on_page could have been found on a different page, or be # known to be missing. So cull out everything that has already been # found. search_keys = parents_not_on_page.difference(parent_map).difference( missing_keys ) return search_keys def iter_entries_prefix(self, keys): """Iterate over keys within the index using prefix matching. Prefix matching is applied within the tuple of a key, not to within the bytestring of each key element. e.g. if you have the keys ('foo', 'bar'), ('foobar', 'gam') and do a prefix search for ('foo', None) then only the former key is returned. WARNING: Note that this method currently causes a full index parse unconditionally (which is reasonably appropriate as it is a means for thunking many small indices into one larger one and still supplies iter_all_entries at the thunk layer). :param keys: An iterable providing the key prefixes to be retrieved. Each key prefix takes the form of a tuple the length of a key, but with the last N elements 'None' rather than a regular bytestring. The first element cannot be 'None'. :return: An iterable as per iter_all_entries, but restricted to the keys with a matching prefix to those supplied. No additional keys will be returned, and every match that is in the index will be returned. """ keys = sorted(set(keys)) if not keys: return # Load if needed to check key lengths if self._key_count is None: self._get_root_node() # TODO: only access nodes that can satisfy the prefixes we are looking # for. For now, to meet API usage (as this function is not used by # current breezy) just suck the entire index and iterate in memory. nodes = {} if self.node_ref_lists: if self._key_length == 1: for _1, key, value, refs in self.iter_all_entries(): nodes[key] = value, refs else: nodes_by_key = {} for _1, key, value, refs in self.iter_all_entries(): key_value = key, value, refs # For a key of (foo, bar, baz) create # _nodes_by_key[foo][bar][baz] = key_value key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key_value else: if self._key_length == 1: for _1, key, value in self.iter_all_entries(): nodes[key] = value else: nodes_by_key = {} for _1, key, value in self.iter_all_entries(): key_value = key, value # For a key of (foo, bar, baz) create # _nodes_by_key[foo][bar][baz] = key_value key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key_value if self._key_length == 1: for key in keys: _mod_index._sanity_check_key(self, key) try: if self.node_ref_lists: value, node_refs = nodes[key] yield self, key, value, node_refs else: yield self, key, nodes[key] except KeyError: pass return yield from _mod_index._iter_entries_prefix(self, nodes_by_key, keys) def key_count(self): """Return an estimate of the number of keys in this index. For BTreeGraphIndex the estimate is exact as it is contained in the header. """ if self._key_count is None: self._get_root_node() return self._key_count def _compute_row_offsets(self): """Fill out the _row_offsets attribute based on _row_lengths.""" offsets = [] row_offset = 0 for row in self._row_lengths: offsets.append(row_offset) row_offset += row offsets.append(row_offset) self._row_offsets = offsets def _parse_header_from_bytes(self, bytes): """Parse the header from a region of bytes. :param bytes: The data to parse. :return: An offset, data tuple such as readv yields, for the unparsed data. (which may be of length 0). """ signature = bytes[0 : len(self._signature())] if not signature == self._signature(): raise _mod_index.BadIndexFormatSignature(self._name, BTreeGraphIndex) lines = bytes[len(self._signature()) :].splitlines() options_line = lines[0] if not options_line.startswith(_OPTION_NODE_REFS): raise _mod_index.BadIndexOptions(self) try: self.node_ref_lists = int(options_line[len(_OPTION_NODE_REFS) :]) except ValueError as e: raise _mod_index.BadIndexOptions(self) from e options_line = lines[1] if not options_line.startswith(_OPTION_KEY_ELEMENTS): raise _mod_index.BadIndexOptions(self) try: self._key_length = int(options_line[len(_OPTION_KEY_ELEMENTS) :]) except ValueError as e: raise _mod_index.BadIndexOptions(self) from e options_line = lines[2] if not options_line.startswith(_OPTION_LEN): raise _mod_index.BadIndexOptions(self) try: self._key_count = int(options_line[len(_OPTION_LEN) :]) except ValueError as e: raise _mod_index.BadIndexOptions(self) from e options_line = lines[3] if not options_line.startswith(_OPTION_ROW_LENGTHS): raise _mod_index.BadIndexOptions(self) try: self._row_lengths = [ int(length) for length in options_line[len(_OPTION_ROW_LENGTHS) :].split(b",") if length ] except ValueError as e: raise _mod_index.BadIndexOptions(self) from e self._compute_row_offsets() # calculate the bytes we have processed header_end = len(signature) + sum(map(len, lines[0:4])) + 4 return header_end, bytes[header_end:] def _read_nodes(self, nodes): """Read some nodes from disk into the LRU cache. This performs a readv to get the node data into memory, and parses each node, then yields it to the caller. The nodes are requested in the supplied order. If possible doing sort() on the list before requesting a read may improve performance. :param nodes: The nodes to read. 0 - first node, 1 - second node etc. :return: None """ # may be the byte string of the whole file bytes = None # list of (offset, length) regions of the file that should, evenually # be read in to data_ranges, either from 'bytes' or from the transport ranges = [] base_offset = self._base_offset for index in nodes: offset = index * _PAGE_SIZE size = _PAGE_SIZE if index == 0: # Root node - special case if self._size: size = min(_PAGE_SIZE, self._size) else: # The only case where we don't know the size, is for very # small indexes. So we read the whole thing bytes = self._transport.get_bytes(self._name) num_bytes = len(bytes) self._size = num_bytes - base_offset # the whole thing should be parsed out of 'bytes' ranges = [ (start, min(_PAGE_SIZE, num_bytes - start)) for start in range(base_offset, num_bytes, _PAGE_SIZE) ] break else: if offset > self._size: raise AssertionError( "tried to read past the end" f" of the file {offset} > {self._size}" ) size = min(size, self._size - offset) ranges.append((base_offset + offset, size)) if not ranges: return elif bytes is not None: # already have the whole file data_ranges = [ (start, bytes[start : start + size]) for start, size in ranges ] elif self._file is None: data_ranges = self._transport.readv(self._name, ranges) else: data_ranges = [] for offset, size in ranges: self._file.seek(offset) data_ranges.append((offset, self._file.read(size))) for offset, data in data_ranges: offset -= base_offset if offset == 0: # extract the header offset, data = self._parse_header_from_bytes(data) if len(data) == 0: continue bytes = zlib.decompress(data) if bytes.startswith(_LEAF_FLAG): node = self._leaf_factory(bytes, self._key_length, self.node_ref_lists) elif bytes.startswith(_INTERNAL_FLAG): node = _InternalNode(bytes) else: raise AssertionError(f"Unknown node type for {bytes!r}") yield offset // _PAGE_SIZE, node def _signature(self): """The file signature for this index type.""" return _BTSIGNATURE def validate(self): """Validate that everything in the index can be accessed.""" # just read and parse every node. self._get_root_node() if len(self._row_lengths) > 1: start_node = self._row_offsets[1] else: # We shouldn't be reading anything anyway start_node = 1 node_end = self._row_offsets[-1] for _node in self._read_nodes(list(range(start_node, node_end))): pass _gcchk_factory = _LeafNode try: from . import _btree_serializer_pyx as _btree_serializer # type: ignore _gcchk_factory = _btree_serializer._parse_into_chk # type: ignore except ModuleNotFoundError as e: osutils.failed_to_load_extension(e) from bzrformats import _btree_serializer_py as _btree_serializer bzrformats_3.4.0.orig/bzrformats/chk_map.py0000644000000000000000000024611415162115103016026 0ustar00# Copyright (C) 2008-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA r"""Persistent maps from tuple_of_strings->string using CHK stores. Overview and current status: The CHKMap class implements a dict from tuple_of_strings->string by using a trie with internal nodes of 8-bit fan out; The key tuples are mapped to strings by joining them by \x00, and \x00 padding shorter keys out to the length of the longest key. Leaf nodes are packed as densely as possible, and internal nodes are all an additional 8-bits wide leading to a sparse upper tree. Updates to a CHKMap are done preferentially via the apply_delta method, to allow optimisation of the update operation; but individual map/unmap calls are possible and supported. Individual changes via map/unmap are buffered in memory until the _save method is called to force serialisation of the tree. apply_delta records its changes immediately by performing an implicit _save. Todo: ----- Densely packed upper nodes. """ import heapq import logging import threading from collections.abc import Callable, Generator, Iterator from typing import Union from . import lru_cache, osutils from ._bzr_rs import chk_map as _chk_map_rs from .errors import NoSuchRevision from .registry import Registry logger = logging.getLogger("bzrformats.chk_map") common_prefix_many = _chk_map_rs.common_prefix_many common_prefix_pair = _chk_map_rs.common_prefix_pair # approx 4MB # If each line is 50 bytes, and you have 255 internal pages, with 255-way fan # out, it takes 3.1MB to cache the layer. _PAGE_CACHE_SIZE = 4 * 1024 * 1024 Key = tuple[bytes, ...] SerialisedKey = bytes SearchKeyFunc = Callable[[Key], bytes] KeyFilter = list[Key] # Per thread caches for 2 reasons: # - in the server we may be serving very different content, so we get less # cache thrashing. # - we avoid locking on every cache lookup. _thread_caches = threading.local() # The page cache. _thread_caches.page_cache = None def _get_cache(): """Get the per-thread page cache. We need a function to do this because in a new thread the _thread_caches threading.local object does not have the cache initialized yet. """ page_cache = getattr(_thread_caches, "page_cache", None) if page_cache is None: # We are caching bytes so len(value) is perfectly accurate page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE) _thread_caches.page_cache = page_cache return page_cache def clear_cache(): """Clear the CHK map page cache.""" _get_cache().clear() # If a ChildNode falls below this many bytes, we check for a remap _INTERESTING_NEW_SIZE = 50 # If a ChildNode shrinks by more than this amount, we check for a remap _INTERESTING_SHRINKAGE_LIMIT = 20 def _search_key_plain(key: Key) -> SerialisedKey: """Map the key tuple into a search string that just uses the key bytes.""" return b"\x00".join(key) search_key_registry = Registry[bytes, Callable[[Key], SerialisedKey], None]() search_key_registry.register(b"plain", _search_key_plain) def _deserialise_leaf_node(data, key, search_key_func=None): """Deserialise bytes, with key key, into a LeafNode. :param bytes: The bytes of the node. :param key: The key that the serialised node has. """ result = LeafNode(search_key_func=search_key_func) # Splitlines can split on '\r' so don't use it, split('\n') adds an # extra '' if the bytes ends in a final newline. lines = data.split(b"\n") trailing = lines.pop() if trailing != b"": raise AssertionError(f"We did not have a final newline for {key}") items = {} if lines[0] != b"chkleaf:": raise ValueError(f"not a serialised leaf node: {bytes!r}") maximum_size = int(lines[1]) width = int(lines[2]) length = int(lines[3]) prefix = lines[4] pos = 5 while pos < len(lines): line = prefix + lines[pos] elements = line.split(b"\x00") pos += 1 if len(elements) != width + 1: raise AssertionError( "Incorrect number of elements (%d vs %d) for: %r" % (len(elements), width + 1, line) ) num_value_lines = int(elements[-1]) value_lines = lines[pos : pos + num_value_lines] pos += num_value_lines value = b"\n".join(value_lines) items[tuple(elements[:-1])] = value if len(items) != length: raise AssertionError( "item count (%d) mismatch for key %s, bytes %r" % (length, key, bytes) ) result._items = items result._len = length result._maximum_size = maximum_size result._key = key result._key_width = width result._raw_size = ( sum(map(len, lines[5:])) # the length of the suffix + (length) * (len(prefix)) + (len(lines) - 5) ) if not items: result._search_prefix = None result._common_serialised_prefix = None else: result._search_prefix = _unknown result._common_serialised_prefix = prefix if len(data) != result._current_size(): raise AssertionError("_current_size computed incorrectly") return result def _deserialise_internal_node(data, key, search_key_func=None): result = InternalNode(search_key_func=search_key_func) # Splitlines can split on '\r' so don't use it, remove the extra '' # from the result of split('\n') because we should have a trailing # newline lines = data.split(b"\n") if lines[-1] != b"": raise ValueError("last line must be ''") lines.pop(-1) items = {} if lines[0] != b"chknode:": raise ValueError(f"not a serialised internal node: {bytes!r}") maximum_size = int(lines[1]) width = int(lines[2]) length = int(lines[3]) common_prefix = lines[4] for line in lines[5:]: line = common_prefix + line prefix, flat_key = line.rsplit(b"\x00", 1) items[prefix] = (flat_key,) if len(items) == 0: raise AssertionError(f"We didn't find any item for {key}") result._items = items result._len = length result._maximum_size = maximum_size result._key = key result._key_width = width # XXX: InternalNodes don't really care about their size, and this will # change if we add prefix compression result._raw_size = None # len(bytes) result._node_width = len(prefix) result._search_prefix = common_prefix return result class CHKMap: """A persistent map from string to string backed by a CHK store.""" __slots__ = ("_root_node", "_search_key_func", "_store") _root_node: Union["Node", Key] def __init__( self, store, root_key: Key | None, search_key_func: SearchKeyFunc | None = None, ): """Create a CHKMap object. :param store: The store the CHKMap is stored in. :param root_key: The root key of the map. None to create an empty CHKMap. :param search_key_func: A function mapping a key => bytes. These bytes are then used by the internal nodes to split up leaf nodes into multiple pages. """ self._store = store if search_key_func is None: search_key_func = _search_key_plain self._search_key_func = search_key_func if root_key is None: self._root_node = LeafNode(search_key_func=search_key_func) else: self._root_node = self._node_key(root_key) def apply_delta(self, delta): """Apply a delta to the map. :param delta: An iterable of old_key, new_key, new_value tuples. If new_key is not None, then new_key->new_value is inserted into the map; if old_key is not None, then the old mapping of old_key is removed. """ has_deletes = False # Check preconditions first. new_items = { tuple(key) for (old, key, value) in delta if key is not None and old is None } existing_new = list(self.iteritems(key_filter=new_items)) if existing_new: from .errors import InconsistentDeltaDelta raise InconsistentDeltaDelta( delta, f"New items are already in the map {existing_new!r}." ) # Now apply changes. for old, new, _value in delta: if old is not None and old != new: self.unmap(old, check_remap=False) has_deletes = True for _old, new, value in delta: if new is not None: self.map(new, value) if has_deletes: self._check_remap() return self._save() def _ensure_root(self) -> None: """Ensure that the root node is an object not a key.""" if isinstance(self._root_node, tuple): # Demand-load the root self._root_node = self._get_node(self._root_node) def _get_node(self, node: Union[Key, "Node"]) -> "Node": """Get a node. Note that this does not update the _items dict in objects containing a reference to this node. As such it does not prevent subsequent IO being performed. :param node: A tuple key or node object. :return: A node object. """ if isinstance(node, tuple): bytes = self._read_bytes(node) return _deserialise(bytes, node, search_key_func=self._search_key_func) else: return node def _read_bytes(self, key: Key) -> bytes: try: return _get_cache()[key] except KeyError: stream = self._store.get_record_stream([key], "unordered", True) bytes = next(stream).get_bytes_as("fulltext") _get_cache()[key] = bytes return bytes def _dump_tree(self, include_keys=False, encoding="utf-8"): """Return the tree in a string representation.""" self._ensure_root() def decode(x): return x.decode(encoding) res = self._dump_tree_node( self._root_node, prefix=b"", indent="", decode=decode, include_keys=include_keys, ) res.append("") # Give a trailing '\n' return "\n".join(res) def decode(x): return x.decode(encoding) res = self._dump_tree_node( self._root_node, prefix=b"", indent="", decode=decode, include_keys=include_keys, ) res.append("") # Give a trailing '\n' return "\n".join(res) def _dump_tree_node( self, node: "Node", prefix, indent, decode, include_keys: bool = True ) -> list[str]: """For this node and all children, generate a string representation.""" result = [] if not include_keys: key_str = "" else: node_key = node.key() key_str = f" {decode(node_key[0])}" if node_key is not None else " None" result.append(f"{indent}{decode(prefix)!r} {node.__class__.__name__}{key_str}") if isinstance(node, InternalNode): # Trigger all child nodes to get loaded list(node._iter_nodes(self._store)) for prefix, sub in sorted(node._items.items()): result.extend( self._dump_tree_node( sub, prefix, indent + " ", decode=decode, include_keys=include_keys, ) ) else: for key, value in sorted(node._items.items()): # Don't use prefix nor indent here to line up when used in # tests in conjunction with assertEqualDiff result.append( f" {tuple([decode(ke) for ke in key])!r} {decode(value)!r}" ) return result @classmethod def from_dict( cls, store, initial_value, maximum_size: int = 0, key_width: int = 1, search_key_func: SearchKeyFunc | None = None, ): """Create a CHKMap in store with initial_value as the content. :param store: The store to record initial_value in, a VersionedFiles object with 1-tuple keys supporting CHK key generation. :param initial_value: A dict to store in store. Its keys and values must be bytestrings. :param maximum_size: The maximum_size rule to apply to nodes. This determines the size at which no new data is added to a single node. :param key_width: The number of elements in each key_tuple being stored in this map. :param search_key_func: A function mapping a key => bytes. These bytes are then used by the internal nodes to split up leaf nodes into multiple pages. :return: The root chk of the resulting CHKMap. """ root_key = cls._create_directly( store, initial_value, maximum_size=maximum_size, key_width=key_width, search_key_func=search_key_func, ) if not isinstance(root_key, tuple): raise AssertionError(f"we got a {type(root_key)} instead of a tuple") return root_key @classmethod def _create_via_map( cls, store, initial_value, maximum_size: int = 0, key_width: int = 1, search_key_func: SearchKeyFunc | None = None, ): result = cls(store, None, search_key_func=search_key_func) # root_key=None means _root_node is a LeafNode, not a tuple if not isinstance(result._root_node, Node): raise AssertionError("expected root node to be Node") result._root_node.set_maximum_size(maximum_size) result._root_node._key_width = key_width delta = [] for key, value in initial_value.items(): delta.append((None, key, value)) root_key = result.apply_delta(delta) return root_key @classmethod def _create_directly( cls, store, initial_value, maximum_size: int = 0, key_width: int = 1, search_key_func: SearchKeyFunc | None = None, ): leaf_node = LeafNode(search_key_func=search_key_func) leaf_node.set_maximum_size(maximum_size) leaf_node._key_width = key_width leaf_node._items = {tuple(key): val for key, val in initial_value.items()} leaf_node._raw_size = sum( leaf_node._key_value_len(key, value) for key, value in leaf_node._items.items() ) leaf_node._len = len(leaf_node._items) leaf_node._compute_search_prefix() leaf_node._compute_serialised_prefix() node: LeafNode | InternalNode if ( leaf_node._len > 1 and maximum_size and leaf_node._current_size() > maximum_size ): prefix, node_details = leaf_node._split(store) if len(node_details) == 1: raise AssertionError("Failed to split using node._split") internal_node = InternalNode(prefix, search_key_func=search_key_func) internal_node.set_maximum_size(maximum_size) internal_node._key_width = key_width for split, subnode in node_details: internal_node.add_node(split, subnode) node = internal_node else: node = leaf_node keys = list(node.serialise(store)) return keys[-1] def iter_changes(self, basis): """Iterate over the changes between basis and self. :return: An iterator of tuples: (key, old_value, new_value). Old_value is None for keys only in self; new_value is None for keys only in basis. """ # Overview: # Read both trees in lexographic, highest-first order. # Any identical nodes we skip # Any unique prefixes we output immediately. # values in a leaf node are treated as single-value nodes in the tree # which allows them to be not-special-cased. We know to output them # because their value is a string, not a key(tuple) or node. # # corner cases to beware of when considering this function: # *) common references are at different heights. # consider two trees: # {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}} # {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, # 'ab':LeafNode={'ab':'bar'}} # 'b': LeafNode={'b'}} # the node with aaa/aab will only be encountered in the second tree # after reading the 'a' subtree, but it is encountered in the first # tree immediately. Variations on this may have read internal nodes # like this. we want to cut the entire pending subtree when we # realise we have a common node. For this we use a list of keys - # the path to a node - and check the entire path is clean as we # process each item. if self._node_key(self._root_node) == self._node_key(basis._root_node): return self._ensure_root() basis._ensure_root() excluded_keys = set() self_node = self._root_node basis_node = basis._root_node # A heap, each element is prefix, node(tuple/NodeObject/string), # key_path (a list of tuples, tail-sharing down the tree.) self_pending = [] basis_pending = [] def process_node(node, path, a_map, pending): # take a node and expand it node = a_map._get_node(node) if isinstance(node, LeafNode): path = (node._key, path) for key, value in node._items.items(): # For a LeafNode, the key is a serialized_key, rather than # a search_key, but the heap is using search_keys search_key = node._search_key_func(key) heapq.heappush(pending, (search_key, key, value, path)) else: # type(node) == InternalNode path = (node._key, path) for prefix, child in node._items.items(): heapq.heappush(pending, (prefix, None, child, path)) def process_common_internal_nodes(self_node, basis_node): self_items = set(self_node._items.items()) basis_items = set(basis_node._items.items()) path = (self_node._key, None) for prefix, child in self_items - basis_items: heapq.heappush(self_pending, (prefix, None, child, path)) path = (basis_node._key, None) for prefix, child in basis_items - self_items: heapq.heappush(basis_pending, (prefix, None, child, path)) def process_common_leaf_nodes(self_node, basis_node): self_items = set(self_node._items.items()) basis_items = set(basis_node._items.items()) path = (self_node._key, None) for key, value in self_items - basis_items: prefix = self._search_key_func(key) heapq.heappush(self_pending, (prefix, key, value, path)) path = (basis_node._key, None) for key, value in basis_items - self_items: prefix = basis._search_key_func(key) heapq.heappush(basis_pending, (prefix, key, value, path)) def process_common_prefix_nodes(self_node, self_path, basis_node, basis_path): # Would it be more efficient if we could request both at the same # time? self_node = self._get_node(self_node) basis_node = basis._get_node(basis_node) if isinstance(self_node, InternalNode) and isinstance( basis_node, InternalNode ): # Matching internal nodes process_common_internal_nodes(self_node, basis_node) elif isinstance(self_node, LeafNode) and isinstance(basis_node, LeafNode): process_common_leaf_nodes(self_node, basis_node) else: process_node(self_node, self_path, self, self_pending) process_node(basis_node, basis_path, basis, basis_pending) process_common_prefix_nodes(self_node, None, basis_node, None) excluded_keys = set() def check_excluded(key_path): # Note that this is N^2, it depends on us trimming trees # aggressively to not become slow. # A better implementation would probably have a reverse map # back to the children of a node, and jump straight to it when # a common node is detected, the proceed to remove the already # pending children. breezy.graph has a searcher module with a # similar problem. while key_path is not None: key, key_path = key_path if key in excluded_keys: return True return False loop_counter = 0 while self_pending or basis_pending: loop_counter += 1 if not self_pending: # self is exhausted: output remainder of basis for _prefix, key, node, path in basis_pending: if check_excluded(path): continue node = basis._get_node(node) if key is not None: # a value yield (key, node, None) else: # subtree - fastpath the entire thing. for key, value in node.iteritems(basis._store): yield (key, value, None) return elif not basis_pending: # basis is exhausted: output remainder of self. for _prefix, key, node, path in self_pending: if check_excluded(path): continue node = self._get_node(node) if key is not None: # a value yield (key, None, node) else: # subtree - fastpath the entire thing. for key, value in node.iteritems(self._store): yield (key, None, value) return else: # XXX: future optimisation - yield the smaller items # immediately rather than pushing everything on/off the # heaps. Applies to both internal nodes and leafnodes. if self_pending[0][0] < basis_pending[0][0]: # expand self _prefix, key, node, path = heapq.heappop(self_pending) if check_excluded(path): continue if key is not None: # a value yield (key, None, node) else: process_node(node, path, self, self_pending) continue elif self_pending[0][0] > basis_pending[0][0]: # expand basis _prefix, key, node, path = heapq.heappop(basis_pending) if check_excluded(path): continue if key is not None: # a value yield (key, node, None) else: process_node(node, path, basis, basis_pending) continue else: # common prefix: possibly expand both if self_pending[0][1] is None: # process next self read_self = True else: read_self = False if basis_pending[0][1] is None: # process next basis read_basis = True else: read_basis = False if not read_self and not read_basis: # compare a common value self_details = heapq.heappop(self_pending) basis_details = heapq.heappop(basis_pending) if self_details[2] != basis_details[2]: yield (self_details[1], basis_details[2], self_details[2]) continue # At least one side wasn't a simple value if self._node_key(self_pending[0][2]) == self._node_key( basis_pending[0][2] ): # Identical pointers, skip (and don't bother adding to # excluded, it won't turn up again. heapq.heappop(self_pending) heapq.heappop(basis_pending) continue # Now we need to expand this node before we can continue if read_self and read_basis: # Both sides start with the same prefix, so process # them in parallel self_prefix, _, self_node, self_path = heapq.heappop( self_pending ) basis_prefix, _, basis_node, basis_path = heapq.heappop( basis_pending ) if self_prefix != basis_prefix: raise AssertionError(f"{self_prefix!r} != {basis_prefix!r}") process_common_prefix_nodes( self_node, self_path, basis_node, basis_path ) continue if read_self: _prefix, key, node, path = heapq.heappop(self_pending) if check_excluded(path): continue process_node(node, path, self, self_pending) if read_basis: _prefix, key, node, path = heapq.heappop(basis_pending) if check_excluded(path): continue process_node(node, path, basis, basis_pending) # print loop_counter def iteritems( self, key_filter: KeyFilter | None = None ) -> Iterator[tuple[Key, bytes]]: """Iterate over the entire CHKMap's contents.""" self._ensure_root() if isinstance(self._root_node, tuple): raise AssertionError("Cannot iterate over a map with a tuple root node") if key_filter is not None: key_filter = [tuple(key) for key in key_filter] return self._root_node.iteritems(self._store, key_filter=key_filter) def key(self) -> Key: """Return the key for this map.""" if isinstance(self._root_node, tuple): return self._root_node elif isinstance(self._root_node, Node): if self._root_node is None: raise AssertionError("No root node") return self._root_node._key else: raise AssertionError( "Invalid root node type: {!r}".format(type(self._root_node)) ) def __len__(self) -> int: """Return the number of items in the CHK map.""" self._ensure_root() return len(self._root_node) def map(self, key: Key, value) -> None: """Map a key tuple to value. :param key: A key to map. :param value: The value to assign to key. """ key = tuple(key) # Need a root object. self._ensure_root() if isinstance(self._root_node, tuple): raise AssertionError("Cannot map a key to a tuple root node") prefix, node_details = self._root_node.map(self._store, key, value) if len(node_details) == 1: self._root_node = node_details[0][1] else: self._root_node = InternalNode( prefix, search_key_func=self._search_key_func ) self._root_node.set_maximum_size(node_details[0][1].maximum_size) self._root_node._key_width = node_details[0][1]._key_width for split, node in node_details: self._root_node.add_node(split, node) def _node_key(self, node): """Get the key for a node whether it's a tuple or node.""" if isinstance(node, tuple): return node elif isinstance(node, Node): return node._key else: raise AssertionError("Invalid node type: {!r}".format(type(node))) def unmap(self, key, check_remap=True): """Remove key from the map.""" self._ensure_root() if isinstance(self._root_node, InternalNode): unmapped = self._root_node.unmap(self._store, key, check_remap=check_remap) else: unmapped = self._root_node.unmap(self._store, key) self._root_node = unmapped def _check_remap(self) -> None: """Check if nodes can be collapsed.""" self._ensure_root() if isinstance(self._root_node, InternalNode): self._root_node = self._root_node._check_remap(self._store) def _save(self): """Save the map completely. :return: The key of the root node. """ if isinstance(self._root_node, tuple): # Already saved. return self._root_node keys = list(self._root_node.serialise(self._store)) return keys[-1] class Node: """Base class defining the protocol for CHK Map nodes. :ivar _raw_size: The total size of the serialized key:value data, before adding the header bytes, and without prefix compression. """ __slots__ = ( "_items", "_key", "_key_width", "_len", "_maximum_size", "_raw_size", "_search_key_func", "_search_prefix", ) def __init__(self, key_width=1): """Create a node. :param key_width: The width of keys for this node. """ self._key = None # Current number of elements self._len = 0 self._maximum_size = 0 self._key_width = key_width # current size in bytes self._raw_size = 0 # The pointers/values this node has - meaning defined by child classes. self._items = {} # The common search prefix self._search_prefix = None def __repr__(self): """Return string representation of the node.""" items_str = str(sorted(self._items)) if len(items_str) > 20: items_str = items_str[:16] + "...]" return "{}(key:{} len:{} size:{} max:{} prefix:{} items:{})".format( self.__class__.__name__, self._key, self._len, self._raw_size, self._maximum_size, self._search_prefix, items_str, ) def iteritems(self, store, key_filter=None): """Iterate over items in the node. :param key_filter: A filter to apply to the node. It should be a list/set/dict or similar repeatedly iterable container. """ raise NotImplementedError(self.iteritems) def unmap(self, store, key): """Unmap key from the node.""" raise NotImplementedError(self.unmap) def map(self, store, key: Key, value): """Map key to value.""" raise NotImplementedError(self.map) def key(self) -> Key: """Return the key for this node.""" return self._key def __len__(self) -> int: """Return the number of items in this node.""" return self._len @property def maximum_size(self) -> int: """What is the upper limit for adding references to a node.""" return self._maximum_size def set_maximum_size(self, new_size): """Set the size threshold for nodes. :param new_size: The size at which no data is added to a node. 0 for unlimited. """ self._maximum_size = new_size # Singleton indicating we have not computed _search_prefix yet _unknown = object() class LeafNode(Node): """A node containing actual key:value pairs. :ivar _items: A dict of key->value items. The key is in tuple form. :ivar _size: The number of bytes that would be used by serializing all of the key/value pairs. """ __slots__ = ("_common_serialised_prefix",) def __init__(self, search_key_func=None): """Initialize a LeafNode. Args: search_key_func: Function to generate search keys from regular keys. """ Node.__init__(self) # All of the keys in this leaf node share this common prefix self._common_serialised_prefix = None if search_key_func is None: self._search_key_func = _search_key_plain else: self._search_key_func = search_key_func def __repr__(self): """Return string representation of the leaf node.""" items_str = str(sorted(self._items)) if len(items_str) > 20: items_str = items_str[:16] + "...]" return "{}(key:{} len:{} size:{} max:{} prefix:{} keywidth:{} items:{})".format( self.__class__.__name__, self._key, self._len, self._raw_size, self._maximum_size, self._search_prefix, self._key_width, items_str, ) def _current_size(self): """Answer the current serialised size of this node. This differs from self._raw_size in that it includes the bytes used for the header. """ if self._common_serialised_prefix is None: bytes_for_items = 0 prefix_len = 0 else: # We will store a single string with the common prefix # And then that common prefix will not be stored in any of the # entry lines prefix_len = len(self._common_serialised_prefix) bytes_for_items = self._raw_size - (prefix_len * self._len) return ( 9 # 'chkleaf:\n' + + len(str(self._maximum_size)) + 1 + len(str(self._key_width)) + 1 + len(str(self._len)) + 1 + prefix_len + 1 + bytes_for_items ) @classmethod def deserialise(cls, bytes, key, search_key_func=None): """Deserialise bytes, with key key, into a LeafNode. :param bytes: The bytes of the node. :param key: The key that the serialised node has. """ return _deserialise_leaf_node(bytes, key, search_key_func=search_key_func) def iteritems(self, store, key_filter=None): """Iterate over items in the node. :param key_filter: A filter to apply to the node. It should be a list/set/dict or similar repeatedly iterable container. """ if key_filter is not None: # Adjust the filter - short elements go to a prefix filter. All # other items are looked up directly. # XXX: perhaps defaultdict? Profiling filters = {} for key in key_filter: if len(key) == self._key_width: # This filter is meant to match exactly one key, yield it # if we have it. try: yield key, self._items[key] except KeyError: # This key is not present in this map, continue pass else: # Short items, we need to match based on a prefix filters.setdefault(len(key), set()).add(key) if filters: filters_itemview = filters.items() for item in self._items.items(): for length, length_filter in filters_itemview: if item[0][:length] in length_filter: yield item break else: yield from self._items.items() def _key_value_len(self, key, value): # TODO: Should probably be done without actually joining the key, but # then that can be done via the C extension return ( len(self._serialise_key(key)) + 1 + len(b"%d" % value.count(b"\n")) + 1 + len(value) + 1 ) def _search_key(self, key: Key) -> bytes: return self._search_key_func(key) def _map_no_split(self, key: Key, value): """Map a key to a value. This assumes either the key does not already exist, or you have already removed its size and length from self. :return: True if adding this node should cause us to split. """ self._items[key] = value self._raw_size += self._key_value_len(key, value) self._len += 1 serialised_key = self._serialise_key(key) if self._common_serialised_prefix is None: self._common_serialised_prefix = serialised_key else: self._common_serialised_prefix = common_prefix_pair( self._common_serialised_prefix, serialised_key ) search_key = self._search_key(key) if self._search_prefix is _unknown: self._compute_search_prefix() if self._search_prefix is None: self._search_prefix = search_key else: self._search_prefix = common_prefix_pair(self._search_prefix, search_key) if ( self._len > 1 and self._maximum_size and self._current_size() > self._maximum_size ): # Check to see if all of the search_keys for this node are # identical. We allow the node to grow under that circumstance # (we could track this as common state, but it is infrequent) if ( search_key != self._search_prefix or not self._are_search_keys_identical() ): return True return False def _split(self, store): """We have overflowed. Split this node into multiple LeafNodes, return it up the stack so that the next layer creates a new InternalNode and references the new nodes. :return: (common_serialised_prefix, [(node_serialised_prefix, node)]) """ if self._search_prefix is _unknown: raise AssertionError("Search prefix must be known") common_prefix = self._search_prefix split_at = len(common_prefix) + 1 result = {} for key, value in self._items.items(): search_key = self._search_key(key) prefix = search_key[:split_at] # TODO: Generally only 1 key can be exactly the right length, # which means we can only have 1 key in the node pointed # at by the 'prefix\0' key. We might want to consider # folding it into the containing InternalNode rather than # having a fixed length-1 node. # Note this is probably not true for hash keys, as they # may get a '\00' node anywhere, but won't have keys of # different lengths. if len(prefix) < split_at: prefix += b"\x00" * (split_at - len(prefix)) if prefix not in result: node = LeafNode(search_key_func=self._search_key_func) node.set_maximum_size(self._maximum_size) node._key_width = self._key_width result[prefix] = node else: node = result[prefix] sub_prefix, node_details = node.map(store, key, value) if len(node_details) > 1: if prefix != sub_prefix: # This node has been split and is now found via a different # path result.pop(prefix) new_node = InternalNode( sub_prefix, search_key_func=self._search_key_func ) new_node.set_maximum_size(self._maximum_size) new_node._key_width = self._key_width for split, node in node_details: new_node.add_node(split, node) result[prefix] = new_node return common_prefix, list(result.items()) def map(self, store, key: Key, value): """Map key to value.""" if key in self._items: self._raw_size -= self._key_value_len(key, self._items[key]) self._len -= 1 self._key = None if self._map_no_split(key, value): return self._split(store) else: if self._search_prefix is _unknown: raise AssertionError(f"{self._search_prefix!r} must be known") return self._search_prefix, [(b"", self)] @staticmethod def _serialise_key(key): return b"\x00".join(key) def serialise(self, store): """Serialise the LeafNode to store. :param store: A VersionedFiles honouring the CHK extensions. :return: An iterable of the keys inserted by this operation. """ lines = [b"chkleaf:\n"] lines.append(b"%d\n" % self._maximum_size) lines.append(b"%d\n" % self._key_width) lines.append(b"%d\n" % self._len) if self._common_serialised_prefix is None: lines.append(b"\n") if len(self._items) != 0: raise AssertionError( "If _common_serialised_prefix is None we should have no items" ) else: lines.append(b"%s\n" % (self._common_serialised_prefix,)) prefix_len = len(self._common_serialised_prefix) for key, value in sorted(self._items.items()): # Always add a final newline value_lines = osutils.chunks_to_lines([value + b"\n"]) serialized = b"%s\x00%d\n" % (self._serialise_key(key), len(value_lines)) if not serialized.startswith(self._common_serialised_prefix): raise AssertionError( f"We thought the common prefix was {self._common_serialised_prefix!r}" f" but entry {serialized!r} does not have it in common" ) lines.append(serialized[prefix_len:]) lines.extend(value_lines) sha1, _, _ = store.add_lines((None,), (), lines) self._key = (b"sha1:" + sha1,) data = b"".join(lines) if len(data) != self._current_size(): raise AssertionError("Invalid _current_size") _get_cache()[self._key] = data return [self._key] def refs(self): """Return the references to other CHK's held by this node.""" return [] def _compute_search_prefix(self): """Determine the common search prefix for all keys in this node. :return: A bytestring of the longest search key prefix that is unique within this node. """ search_keys = [self._search_key_func(key) for key in self._items] self._search_prefix = common_prefix_many(search_keys) return self._search_prefix def _are_search_keys_identical(self): """Check to see if the search keys for all entries are the same. When using a hash as the search_key it is possible for non-identical keys to collide. If that happens enough, we may try overflow a LeafNode, but as all are collisions, we must not split. """ common_search_key = None for key in self._items: search_key = self._search_key(key) if common_search_key is None: common_search_key = search_key elif search_key != common_search_key: return False return True def _compute_serialised_prefix(self): """Determine the common prefix for serialised keys in this node. :return: A bytestring of the longest serialised key prefix that is unique within this node. """ serialised_keys = [self._serialise_key(key) for key in self._items] self._common_serialised_prefix = common_prefix_many(serialised_keys) return self._common_serialised_prefix def unmap(self, store, key): """Unmap key from the node.""" try: self._raw_size -= self._key_value_len(key, self._items[key]) except KeyError: logger.debug("key %s not found in %r", key, self._items) raise self._len -= 1 del self._items[key] self._key = None # Recompute from scratch self._compute_search_prefix() self._compute_serialised_prefix() return self class InternalNode(Node): """A node that contains references to other nodes. An InternalNode is responsible for mapping search key prefixes to child nodes. :ivar _items: serialised_key => node dictionary. node may be a tuple, LeafNode or InternalNode. """ __slots__ = ("_node_width",) def __init__(self, prefix=b"", search_key_func=None): """Initialize an InternalNode. Args: prefix: The search key prefix for this node. search_key_func: Function to generate search keys from regular keys. """ Node.__init__(self) # The size of an internalnode with default values and no children. # How many octets key prefixes within this node are. self._node_width = 0 self._search_prefix = prefix if search_key_func is None: self._search_key_func = _search_key_plain else: self._search_key_func = search_key_func def add_node(self, prefix, node: "Node") -> None: """Add a child node with prefix prefix, and node node. :param prefix: The search key prefix for node. :param node: The node being added. """ if self._search_prefix is None: raise AssertionError("_search_prefix should not be None") if not isinstance(node, (tuple, Node)): raise AssertionError("Invalid node type: {!r}".format(type(node))) if not prefix.startswith(self._search_prefix): raise AssertionError( f"prefixes mismatch: {prefix} must start with {self._search_prefix}" ) if len(prefix) != len(self._search_prefix) + 1: raise AssertionError( "prefix wrong length: len(%s) is not %d" % (prefix, len(self._search_prefix) + 1) ) self._len += len(node) if not len(self._items): self._node_width = len(prefix) if self._node_width != len(self._search_prefix) + 1: raise AssertionError( "node width mismatch: %d is not %d" % (self._node_width, len(self._search_prefix) + 1) ) self._items[prefix] = node self._key = None def _current_size(self): """Answer the current serialised size of this node.""" return ( self._raw_size + len(str(self._len)) + len(str(self._key_width)) + len(str(self._maximum_size)) ) @classmethod def deserialise(cls, bytes, key, search_key_func: SearchKeyFunc | None = None): """Deserialise bytes to an InternalNode, with key key. :param bytes: The bytes of the node. :param key: The key that the serialised node has. :return: An InternalNode instance. """ return _deserialise_internal_node(bytes, key, search_key_func=search_key_func) def iteritems( self, store, key_filter: list[Key] | None = None ) -> Generator[tuple[Key, bytes]]: """Iterate over items in this node and its children. Args: store: CHK store to retrieve child nodes from. key_filter: Optional list of keys to filter items. Yields: Tuples of (key, value) for items in this subtree. """ for node, node_filter in self._iter_nodes(store, key_filter=key_filter): yield from node.iteritems(store, key_filter=node_filter) def _iter_nodes( self, store, key_filter: KeyFilter | None = None, batch_size: int | None = None, ) -> Generator[tuple[Node, list[Key] | None]]: """Iterate over node objects which match key_filter. :param store: A store to use for accessing content. :param key_filter: A key filter to filter nodes. Only nodes that might contain a key in key_filter will be returned. :param batch_size: If not None, then we will return the nodes that had to be read using get_record_stream in batches, rather than reading them all at once. :return: An iterable of nodes. This function does not have to be fully consumed. (There will be no pending I/O when items are being returned.) """ # Map from chk key ('sha1:...',) to (prefix, key_filter) # prefix is the key in self._items to use, key_filter is the key_filter # entries that would match this node keys: dict[Key, tuple[SerialisedKey, list[Key] | None]] = {} shortcut = False if key_filter is None: # yielding all nodes, yield whatever we have, and queue up a read # for whatever we are missing shortcut = True for prefix, node in self._items.items(): if isinstance(node, tuple): keys[node] = (prefix, None) elif isinstance(node, Node): yield node, None else: raise AssertionError("Invalid node type: {!r}".format(type(node))) elif len(key_filter) == 1: # Technically, this path could also be handled by the first check # in 'self._node_width' in length_filters. However, we can handle # this case without spending any time building up the # prefix_to_keys, etc state. # This is a bit ugly, but TIMEIT showed it to be by far the fastest # 0.626us list(key_filter)[0] # is a func() for list(), 2 mallocs, and a getitem # 0.489us [k for k in key_filter][0] # still has the mallocs, avoids the func() call # 0.350us iter(key_filter).next() # has a func() call, and mallocs an iterator # 0.125us for key in key_filter: pass # no func() overhead, might malloc an iterator # 0.105us for key in key_filter: break # no func() overhead, might malloc an iterator, probably # avoids checking an 'else' clause as part of the for for key in key_filter: # noqa: B007 break search_prefix = self._search_prefix_filter(key) if len(search_prefix) == self._node_width: # This item will match exactly, so just do a dict lookup, and # see what we can return shortcut = True try: node = self._items[search_prefix] except KeyError: # A given key can only match 1 child node, if it isn't # there, then we can just return nothing return if isinstance(node, tuple): keys[node] = (search_prefix, [key]) elif isinstance(node, Node): # This is loaded, and the only thing that can match, # return yield node, [key] return else: raise AssertionError("Invalid node type: {!r}".format(type(node))) if not shortcut: # First, convert all keys into a list of search prefixes # Aggregate common prefixes, and track the keys they come from prefix_to_keys: dict[SerialisedKey, list[Key]] = {} length_filters: dict[int, set[SerialisedKey]] = {} node_key_filter: list[Key] | None = None if key_filter is None: raise AssertionError("key_filter must not be None") for key in key_filter: search_prefix = self._search_prefix_filter(key) length_filter = length_filters.setdefault(len(search_prefix), set()) length_filter.add(search_prefix) prefix_to_keys.setdefault(search_prefix, []).append(key) if self._node_width in length_filters and len(length_filters) == 1: # all of the search prefixes match exactly _node_width. This # means that everything is an exact match, and we can do a # lookup into self._items, rather than iterating over the items # dict. search_prefixes = length_filters[self._node_width] for search_prefix in search_prefixes: try: node = self._items[search_prefix] except KeyError: # We can ignore this one continue node_key_filter = prefix_to_keys[search_prefix] if isinstance(node, tuple): keys[node] = (search_prefix, node_key_filter) elif isinstance(node, Node): yield node, node_key_filter else: raise AssertionError( "Invalid node type: {!r}".format(type(node)) ) else: # The slow way. We walk every item in self._items, and check to # see if there are any matches length_filters_itemview = length_filters.items() for prefix, node in self._items.items(): node_key_filter = [] for length, length_filter in length_filters_itemview: sub_prefix = prefix[:length] if sub_prefix in length_filter: node_key_filter.extend(prefix_to_keys[sub_prefix]) if node_key_filter: # this key matched something, yield it if isinstance(node, tuple): keys[node] = (prefix, node_key_filter) elif isinstance(node, Node): yield node, node_key_filter else: raise AssertionError( "Invalid node type: {!r}".format(type(node)) ) if keys: # Look in the page cache for some more bytes found_keys = set() for key in keys: try: bytes = _get_cache()[key] except KeyError: continue else: node = _deserialise( bytes, key, search_key_func=self._search_key_func ) prefix, node_key_filter = keys[key] if not isinstance(node, Node): raise AssertionError( "Invalid node type: {!r}".format(type(node)) ) self._items[prefix] = node found_keys.add(key) yield node, node_key_filter for key in found_keys: del keys[key] if keys: # demand load some pages. if batch_size is None: # Read all the keys in batch_size = len(keys) key_order = list(keys) for batch_start in range(0, len(key_order), batch_size): batch = key_order[batch_start : batch_start + batch_size] # We have to fully consume the stream so there is no pending # I/O, so we buffer the nodes for now. stream = store.get_record_stream(batch, "unordered", True) node_and_filters = [] for record in stream: bytes = record.get_bytes_as("fulltext") node = _deserialise( bytes, record.key, search_key_func=self._search_key_func ) prefix, node_key_filter = keys[record.key] node_and_filters.append((node, node_key_filter)) if not isinstance(node, Node): raise AssertionError( "Invalid node type: {!r}".format(type(node)) ) self._items[prefix] = node _get_cache()[record.key] = bytes yield from node_and_filters def map(self, store, key, value): """Map key to value.""" if not len(self._items): raise AssertionError("can't map in an empty InternalNode.") search_key = self._search_key(key) if self._node_width != len(self._search_prefix) + 1: raise AssertionError( "node width mismatch: %d is not %d" % (self._node_width, len(self._search_prefix) + 1) ) if not search_key.startswith(self._search_prefix): # This key doesn't fit in this index, so we need to split at the # point where it would fit, insert self into that internal node, # and then map this key into that node. new_prefix = common_prefix_pair(self._search_prefix, search_key) new_parent = InternalNode(new_prefix, search_key_func=self._search_key_func) new_parent.set_maximum_size(self._maximum_size) new_parent._key_width = self._key_width new_parent.add_node(self._search_prefix[: len(new_prefix) + 1], self) return new_parent.map(store, key, value) children = [node for node, _ in self._iter_nodes(store, key_filter=[key])] if children: child = children[0] else: # new child needed: child = self._new_child(search_key, LeafNode) old_len = len(child) old_size = child._current_size() if isinstance(child, LeafNode) else None prefix, node_details = child.map(store, key, value) if len(node_details) == 1: # child may have shrunk, or might be a new node child = node_details[0][1] self._len = self._len - old_len + len(child) self._items[search_key] = child self._key = None new_node = self if isinstance(child, LeafNode): if old_size is None: # The old node was an InternalNode which means it has now # collapsed, so we need to check if it will chain to a # collapse at this level. logger.debug("checking remap as InternalNode -> LeafNode") new_node = self._check_remap(store) else: # If the LeafNode has shrunk in size, we may want to run # a remap check. Checking for a remap is expensive though # and the frequency of a successful remap is very low. # Shrinkage by small amounts is common, so we only do the # remap check if the new_size is low or the shrinkage # amount is over a configurable limit. new_size = child._current_size() shrinkage = old_size - new_size if ( shrinkage > 0 and new_size < _INTERESTING_NEW_SIZE ) or shrinkage > _INTERESTING_SHRINKAGE_LIMIT: logger.debug( "checking remap as size shrunk by %d to be %d", shrinkage, new_size, ) new_node = self._check_remap(store) if new_node._search_prefix is None: raise AssertionError("_search_prefix should not be None") return new_node._search_prefix, [(b"", new_node)] # child has overflown - create a new intermediate node. # XXX: This is where we might want to try and expand our depth # to refer to more bytes of every child (which would give us # multiple pointers to child nodes, but less intermediate nodes) child = self._new_child(search_key, InternalNode) child._search_prefix = prefix for split, node in node_details: child.add_node(split, node) self._len = self._len - old_len + len(child) self._key = None return self._search_prefix, [(b"", self)] def _new_child(self, search_key, klass): """Create a new child node of type klass.""" child = klass() child.set_maximum_size(self._maximum_size) child._key_width = self._key_width child._search_key_func = self._search_key_func self._items[search_key] = child return child def serialise(self, store): """Serialise the node to store. :param store: A VersionedFiles honouring the CHK extensions. :return: An iterable of the keys inserted by this operation. """ for node in self._items.values(): if isinstance(node, tuple): # Never deserialised. continue elif isinstance(node, Node): if node._key is not None: # Never altered continue for key in node.serialise(store): yield key else: raise AssertionError( f"InternalNode._items should only contain tuples or Nodes, not {node.__class__}" ) lines = [b"chknode:\n"] lines.append(b"%d\n" % self._maximum_size) lines.append(b"%d\n" % self._key_width) lines.append(b"%d\n" % self._len) if self._search_prefix is None: raise AssertionError("_search_prefix should not be None") lines.append(b"%s\n" % (self._search_prefix,)) prefix_len = len(self._search_prefix) for prefix, node in sorted(self._items.items()): key = node[0] if isinstance(node, tuple) else node._key[0] serialised = b"%s\x00%s\n" % (prefix, key) if not serialised.startswith(self._search_prefix): raise AssertionError( f"prefixes mismatch: {serialised} must start with {self._search_prefix}" ) lines.append(serialised[prefix_len:]) sha1, _, _ = store.add_lines((None,), (), lines) self._key = (b"sha1:" + sha1,) _get_cache()[self._key] = b"".join(lines) yield self._key def _search_key(self, key: Key) -> SerialisedKey: """Return the serialised key for key in this node.""" # search keys are fixed width. All will be self._node_width wide, so we # pad as necessary. return (self._search_key_func(key) + b"\x00" * self._node_width)[ : self._node_width ] def _search_prefix_filter(self, key: Key) -> SerialisedKey: """Serialise key for use as a prefix filter in iteritems.""" return self._search_key_func(key)[: self._node_width] def _split(self, offset: int) -> Iterator[tuple[SerialisedKey, Node]]: """Split this node into smaller nodes starting at offset. :param offset: The offset to start the new child nodes at. :return: An iterable of (prefix, node) tuples. prefix is a byte prefix for reaching node. """ if offset >= self._node_width: for node in self._items.values(): yield from node._split(offset) def refs(self) -> list[tuple[SerialisedKey, Key]]: """Return the references to other CHK's held by this node.""" if self._key is None: raise AssertionError("unserialised nodes have no refs.") refs = [] for value in self._items.values(): if isinstance(value, tuple): refs.append(value) elif isinstance(value, Node): refs.append(value.key()) else: raise AssertionError( f"InternalNode._items should only contain tuples or Nodes, not {value.__class__}" ) return refs def _compute_search_prefix(self, extra_key=None): """Return the unique key prefix for this node. :return: A bytestring of the longest search key prefix that is unique within this node. """ self._search_prefix = common_prefix_many(list(self._items.keys())) return self._search_prefix def unmap(self, store, key: Key, check_remap: bool = True) -> Node: """Remove key from this node and its children.""" if not len(self._items): raise AssertionError("can't unmap in an empty InternalNode.") children = [node for node, _ in self._iter_nodes(store, key_filter=[key])] if children: child = children[0] else: raise KeyError(key) self._len -= 1 unmapped: Node | None unmapped = child.unmap(store, key) if unmapped is None: raise AssertionError("unmap returned None, but we expected a node") self._key = None search_key = self._search_key(key) if len(unmapped) == 0: # All child nodes are gone, remove the child: del self._items[search_key] unmapped = None else: # Stash the returned node self._items[search_key] = unmapped if len(self._items) == 1: # this node is no longer needed: return list(self._items.values())[0] if isinstance(unmapped, InternalNode): return self if check_remap: return self._check_remap(store) else: return self def _check_remap(self, store) -> "Node": """Check if all keys contained by children fit in a single LeafNode. :param store: A store to use for reading more nodes :return: Either self, or a new LeafNode which should replace self. """ # Logic for how we determine when we need to rebuild # 1) Implicitly unmap() is removing a key which means that the child # nodes are going to be shrinking by some extent. # 2) If all children are LeafNodes, it is possible that they could be # combined into a single LeafNode, which can then completely replace # this internal node with a single LeafNode # 3) If *one* child is an InternalNode, we assume it has already done # all the work to determine that its children cannot collapse, and # we can then assume that those nodes *plus* the current nodes don't # have a chance of collapsing either. # So a very cheap check is to just say if 'unmapped' is an # InternalNode, we don't have to check further. # TODO: Another alternative is to check the total size of all known # LeafNodes. If there is some formula we can use to determine the # final size without actually having to read in any more # children, it would be nice to have. However, we have to be # careful with stuff like nodes that pull out the common prefix # of each key, as adding a new key can change the common prefix # and cause size changes greater than the length of one key. # So for now, we just add everything to a new Leaf until it # splits, as we know that will give the right answer new_leaf = LeafNode(search_key_func=self._search_key_func) new_leaf.set_maximum_size(self._maximum_size) new_leaf._key_width = self._key_width # A batch_size of 16 was chosen because: # a) In testing, a 4k page held 14 times. So if we have more than 16 # leaf nodes we are unlikely to hold them in a single new leaf # node. This still allows for 1 round trip # b) With 16-way fan out, we can still do a single round trip # c) With 255-way fan out, we don't want to read all 255 and destroy # the page cache, just to determine that we really don't need it. for node, _ in self._iter_nodes(store, batch_size=16): if isinstance(node, InternalNode): # Without looking at any leaf nodes, we are sure return self for key, value in node._items.items(): if new_leaf._map_no_split(key, value): return self logger.debug("remap generated a new LeafNode") return new_leaf def _deserialise(data, key, search_key_func): """Helper for repositorydetails - convert bytes to a node.""" if data.startswith(b"chkleaf:\n"): node = LeafNode.deserialise(data, key, search_key_func=search_key_func) elif data.startswith(b"chknode:\n"): node = InternalNode.deserialise(data, key, search_key_func=search_key_func) else: raise AssertionError("Unknown node type.") return node class CHKMapDifference: """Iterate the stored pages and key,value pairs for (new - old). This class provides a generator over the stored CHK pages and the (key, value) pairs that are in any of the new maps and not in any of the old maps. Note that it may yield chk pages that are common (especially root nodes), but it won't yield (key,value) pairs that are common. """ def __init__(self, store, new_root_keys, old_root_keys, search_key_func, pb=None): """Initialize CHKMapDifference. Args: store: CHK store to retrieve nodes from. new_root_keys: Keys of the new root nodes. old_root_keys: Keys of the old root nodes. search_key_func: Function to generate search keys. pb: Optional progress bar. """ # TODO: Should we add a StaticTuple barrier here? It would be nice to # force callers to use StaticTuple, because there will often be # lots of keys passed in here. And even if we cast it locally, # that just meanst that we will have *both* a StaticTuple and a # tuple() in memory, referring to the same object. (so a net # increase in memory, not a decrease.) self._store = store self._new_root_keys = new_root_keys self._old_root_keys = old_root_keys self._pb = pb # All uninteresting chks that we have seen. By the time they are added # here, they should be either fully ignored, or queued up for # processing # TODO: This might grow to a large size if there are lots of merge # parents, etc. However, it probably doesn't scale to O(history) # like _processed_new_refs does. self._all_old_chks = set(self._old_root_keys) # All items that we have seen from the old_root_keys self._all_old_items = set() # These are interesting items which were either read, or already in the # interesting queue (so we don't need to walk them again) # TODO: processed_new_refs becomes O(all_chks), consider switching to # SimpleSet here. self._processed_new_refs = set() self._search_key_func = search_key_func # The uninteresting and interesting nodes to be searched self._old_queue = [] self._new_queue = [] # Holds the (key, value) items found when processing the root nodes, # waiting for the uninteresting nodes to be walked self._new_item_queue = [] self._state = None def _read_nodes_from_store(self, keys): # We chose not to use _get_cache(), because we think in # terms of records to be yielded. Also, we expect to touch each page # only 1 time during this code. (We may want to evaluate saving the # raw bytes into the page cache, which would allow a working tree # update after the fetch to not have to read the bytes again.) stream = self._store.get_record_stream(keys, "unordered", True) for record in stream: if self._pb is not None: self._pb.tick() if record.storage_kind == "absent": raise NoSuchRevision(self._store, record.key) bytes = record.get_bytes_as("fulltext") node = _deserialise( bytes, record.key, search_key_func=self._search_key_func ) if isinstance(node, InternalNode): # Note we don't have to do node.refs() because we know that # there are no children that have been pushed into this node # Note: Using as_st() here seemed to save 1.2MB, which would # indicate that we keep 100k prefix_refs around while # processing. They *should* be shorter lived than that... # It does cost us ~10s of processing time prefix_refs = list(node._items.items()) items = [] else: prefix_refs = [] # Note: We don't use a StaticTuple here. Profiling showed a # minor memory improvement (0.8MB out of 335MB peak 0.2%) # But a significant slowdown (15s / 145s, or 10%) items = list(node._items.items()) yield record, node, prefix_refs, items def _read_old_roots(self): old_chks_to_enqueue = [] all_old_chks = self._all_old_chks for _record, _node, prefix_refs, items in self._read_nodes_from_store( self._old_root_keys ): # Uninteresting node prefix_refs = [p_r for p_r in prefix_refs if p_r[1] not in all_old_chks] new_refs = [p_r[1] for p_r in prefix_refs] all_old_chks.update(new_refs) # TODO: This might be a good time to turn items into StaticTuple # instances and possibly intern them. However, this does not # impact 'initial branch' performance, so I'm not worrying # about this yet self._all_old_items.update(items) # Queue up the uninteresting references # Don't actually put them in the 'to-read' queue until we have # finished checking the interesting references old_chks_to_enqueue.extend(prefix_refs) return old_chks_to_enqueue def _enqueue_old(self, new_prefixes, old_chks_to_enqueue): # At this point, we have read all the uninteresting and interesting # items, so we can queue up the uninteresting stuff, knowing that we've # handled the interesting ones for prefix, ref in old_chks_to_enqueue: not_interesting = True for i in range(len(prefix), 0, -1): if prefix[:i] in new_prefixes: not_interesting = False break if not_interesting: # This prefix is not part of the remaining 'interesting set' continue self._old_queue.append(ref) def _read_all_roots(self): """Read the root pages. This is structured as a generator, so that the root records can be yielded up to whoever needs them without any buffering. """ # This is the bootstrap phase if not self._old_root_keys: # With no old_root_keys we can just shortcut and be ready # for _flush_new_queue self._new_queue = list(self._new_root_keys) return old_chks_to_enqueue = self._read_old_roots() # filter out any root keys that are already known to be uninteresting new_keys = set(self._new_root_keys).difference(self._all_old_chks) # These are prefixes that are present in new_keys that we are # thinking to yield new_prefixes = set() # We are about to yield all of these, so we don't want them getting # added a second time processed_new_refs = self._processed_new_refs processed_new_refs.update(new_keys) for record, _node, prefix_refs, items in self._read_nodes_from_store(new_keys): # At this level, we now know all the uninteresting references # So we filter and queue up whatever is remaining prefix_refs = [ p_r for p_r in prefix_refs if p_r[1] not in self._all_old_chks and p_r[1] not in processed_new_refs ] refs = [p_r[1] for p_r in prefix_refs] new_prefixes.update([p_r[0] for p_r in prefix_refs]) self._new_queue.extend(refs) # TODO: We can potentially get multiple items here, however the # current design allows for this, as callers will do the work # to make the results unique. We might profile whether we # gain anything by ensuring unique return values for items # TODO: This might be a good time to cast to StaticTuple, as # self._new_item_queue will hold the contents of multiple # records for an extended lifetime new_items = [item for item in items if item not in self._all_old_items] self._new_item_queue.extend(new_items) new_prefixes.update([self._search_key_func(item[0]) for item in new_items]) processed_new_refs.update(refs) yield record # For new_prefixes we have the full length prefixes queued up. # However, we also need possible prefixes. (If we have a known ref to # 'ab', then we also need to include 'a'.) So expand the # new_prefixes to include all shorter prefixes for prefix in list(new_prefixes): new_prefixes.update([prefix[:i] for i in range(1, len(prefix))]) self._enqueue_old(new_prefixes, old_chks_to_enqueue) def _flush_new_queue(self): # No need to maintain the heap invariant anymore, just pull things out # and process them refs = set(self._new_queue) self._new_queue = [] # First pass, flush all interesting items and convert to using direct refs all_old_chks = self._all_old_chks processed_new_refs = self._processed_new_refs all_old_items = self._all_old_items new_items = [item for item in self._new_item_queue if item not in all_old_items] self._new_item_queue = [] if new_items: yield None, new_items refs = refs.difference(all_old_chks) processed_new_refs.update(refs) while refs: # TODO: Using a SimpleSet for self._processed_new_refs and # saved as much as 10MB of peak memory. However, it requires # implementing a non-pyrex version. next_refs = set() next_refs_update = next_refs.update # Inlining _read_nodes_from_store improves 'bzr branch bzr.dev' # from 1m54s to 1m51s. Consider it. for record, _, p_refs, items in self._read_nodes_from_store(refs): if all_old_items: # using the 'if' check saves about 145s => 141s, when # streaming initial branch of Launchpad data. items = [item for item in items if item not in all_old_items] yield record, items next_refs_update([p_r[1] for p_r in p_refs]) del p_refs # set1.difference(set/dict) walks all of set1, and checks if it # exists in 'other'. # set1.difference(iterable) walks all of iterable, and does a # 'difference_update' on a clone of set1. Pick wisely based on the # expected sizes of objects. # in our case it is expected that 'new_refs' will always be quite # small. next_refs = next_refs.difference(all_old_chks) next_refs = next_refs.difference(processed_new_refs) processed_new_refs.update(next_refs) refs = next_refs def _process_next_old(self): # Since we don't filter uninteresting any further than during # _read_all_roots, process the whole queue in a single pass. refs = self._old_queue self._old_queue = [] all_old_chks = self._all_old_chks for _record, _, prefix_refs, items in self._read_nodes_from_store(refs): # TODO: Use StaticTuple here? self._all_old_items.update(items) refs = [r for _, r in prefix_refs if r not in all_old_chks] self._old_queue.extend(refs) all_old_chks.update(refs) def _process_queues(self): while self._old_queue: self._process_next_old() return self._flush_new_queue() def process(self): """Process the difference between old and new CHK maps. Yields: Tuples of (record, items) for pages and key-value pairs that are in the new maps but not in the old maps. """ for record in self._read_all_roots(): yield record, [] for record, items in self._process_queues(): yield record, items def iter_interesting_nodes( store, interesting_root_keys, uninteresting_root_keys, pb=None ): """Given root keys, find interesting nodes. Evaluate nodes referenced by interesting_root_keys. Ones that are also referenced from uninteresting_root_keys are not considered interesting. :param interesting_root_keys: keys which should be part of the "interesting" nodes (which will be yielded) :param uninteresting_root_keys: keys which should be filtered out of the result set. :return: Yield (interesting record, {interesting key:values}) """ iterator = CHKMapDifference( store, interesting_root_keys, uninteresting_root_keys, search_key_func=store._search_key_func, pb=pb, ) return iterator.process() from ._bzr_rs import chk_map as _chk_map_rs _bytes_to_text_key = _chk_map_rs._bytes_to_text_key _search_key_16 = _chk_map_rs._search_key_16 _search_key_255 = _chk_map_rs._search_key_255 search_key_registry.register(b"hash-16-way", _search_key_16) search_key_registry.register(b"hash-255-way", _search_key_255) def _check_key(key): """Helper function to assert that a key is properly formatted. This generally shouldn't be used in production code, but it can be helpful to debug problems. """ if not isinstance(key, tuple): raise TypeError(f"key {key!r} is not tuple but {type(key)}") if len(key) != 1: raise ValueError(f"key {key!r} should have length 1, not {len(key)}") if not isinstance(key[0], str): raise TypeError(f"key {key!r} should hold a str, not {type(key[0])!r}") if not key[0].startswith("sha1:"): raise ValueError(f"key {key!r} should point to a sha1:") bzrformats_3.4.0.orig/bzrformats/chk_serializer.py0000644000000000000000000001267115162073400017424 0ustar00# Copyright (C) 2008, 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Serializer object for CHK based inventory storage.""" from . import serializer class CHKSerializer(serializer.InventorySerializer): """A CHKInventory based serializer with 'plain' behaviour.""" support_altered_by_hack = False supported_kinds = {"file", "directory", "symlink", "tree-reference"} def __init__(self, format_num, node_size, search_key_name): """Initialize a CHKSerializer instance. Args: format_num: The format number for the serializer (e.g., b"9" or b"10"). node_size: The maximum size for CHK nodes (typically 65536). search_key_name: The name of the search key algorithm (e.g., b"hash-255-way"). """ self.format_num = format_num self.maximum_size = node_size self.search_key_name = search_key_name def _unpack_inventory( self, elt, revision_id=None, entry_cache=None, return_from_cache=False ): """Construct from XML Element.""" from .xml_serializer import unpack_inventory_entry, unpack_inventory_flat inv = unpack_inventory_flat( elt, self.format_num, unpack_inventory_entry, entry_cache, return_from_cache ) return inv def read_inventory_from_lines( self, xml_lines, revision_id=None, entry_cache=None, return_from_cache=False ): """Read xml_string into an inventory object. :param xml_string: The xml to read. :param revision_id: If not-None, the expected revision id of the inventory. :param entry_cache: An optional cache of InventoryEntry objects. If supplied we will look up entries via (file_id, revision_id) which should map to a valid InventoryEntry (File/Directory/etc) object. :param return_from_cache: Return entries directly from the cache, rather than copying them first. This is only safe if the caller promises not to mutate the returned inventory entries, but it can make some operations significantly faster. """ from .xml_serializer import ParseError, fromstringlist try: return self._unpack_inventory( fromstringlist(xml_lines), revision_id, entry_cache=entry_cache, return_from_cache=return_from_cache, ) except ParseError as e: raise serializer.UnexpectedInventoryFormat(e) from e def read_inventory(self, f, revision_id=None): """Read an inventory from a file-like object.""" from .xml_serializer import ParseError try: try: return self._unpack_inventory(self._read_element(f), revision_id=None) finally: f.close() except ParseError as e: raise serializer.UnexpectedInventoryFormat(e) from e def write_inventory_to_lines(self, inv): """Return a list of lines with the encoded inventory.""" return self.write_inventory(inv, None) def write_inventory_to_chunks(self, inv): """Return a list of lines with the encoded inventory.""" return self.write_inventory(inv, None) def write_inventory(self, inv, f, working=False): """Write inventory to a file. :param inv: the inventory to write. :param f: the file to write. (May be None if the lines are the desired output). :param working: If True skip history data - text_sha1, text_size, reference_revision, symlink_target. :return: The inventory as a list of lines. """ from .xml_serializer import encode_and_escape, serialize_inventory_flat output = [] append = output.append if inv.revision_id is not None: revid = b"".join( [b' revision_id="', encode_and_escape(inv.revision_id), b'"'] ) else: revid = b"" append(b'\n' % (self.format_num, revid)) append( b'\n' % ( encode_and_escape(inv.root.file_id), encode_and_escape(inv.root.name), encode_and_escape(inv.root.revision), ) ) serialize_inventory_flat( inv, append, root_id=None, supported_kinds=self.supported_kinds, working=working, ) if f is not None: f.writelines(output) return output # A CHKInventory based serializer with 'plain' behaviour. inventory_chk_serializer_255_bigpage_9 = CHKSerializer(b"9", 65536, b"hash-255-way") inventory_chk_serializer_255_bigpage_10 = CHKSerializer(b"10", 65536, b"hash-255-way") bzrformats_3.4.0.orig/bzrformats/chunk_writer.py0000644000000000000000000002731515162073400017133 0ustar00# Copyright (C) 2008 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # """ChunkWriter: write compressed data out with a fixed upper bound.""" import zlib from zlib import Z_FINISH, Z_SYNC_FLUSH class ChunkWriter: """ChunkWriter allows writing of compressed data with a fixed size. If less data is supplied than fills a chunk, the chunk is padded with NULL bytes. If more data is supplied, then the writer packs as much in as it can, but never splits any item it was given. The algorithm for packing is open to improvement! Current it is: - write the bytes given - if the total seen bytes so far exceeds the chunk size, flush. :cvar _max_repack: To fit the maximum number of entries into a node, we will sometimes start over and compress the whole list to get tighter packing. We get diminishing returns after a while, so this limits the number of times we will try. The default is to try to avoid recompressing entirely, but setting this to something like 20 will give maximum compression. :cvar _max_zsync: Another tunable nob. If _max_repack is set to 0, then you can limit the number of times we will try to pack more data into a node. This allows us to do a single compression pass, rather than trying until we overflow, and then recompressing again. """ # In testing, some values for bzr.dev:: # repack time MB max full # 1 7.5 4.6 1140 0 # 2 8.4 4.2 1036 1 # 3 9.8 4.1 1012 278 # 4 10.8 4.1 728 945 # 20 11.1 4.1 0 1012 # repack = 0 # zsync time MB repack stop_for_z # 0 5.0 24.7 0 6270 # 1 4.3 13.2 0 3342 # 2 4.9 9.6 0 2414 # 5 4.8 6.2 0 1549 # 6 4.8 5.8 1 1435 # 7 4.8 5.5 19 1337 # 8 4.4 5.3 81 1220 # 10 5.3 5.0 260 967 # 11 5.3 4.9 366 839 # 12 5.1 4.8 454 731 # 15 5.8 4.7 704 450 # 20 5.8 4.6 1133 7 # In testing, some values for mysql-unpacked:: # next_bytes estim # repack time MB full stop_for_repack # 1 15.4 0 3913 # 2 35.4 13.7 0 346 # 20 46.7 13.4 3380 0 # repack=0 # zsync stop_for_z # 0 29.5 116.5 0 29782 # 1 27.8 60.2 0 15356 # 2 27.8 42.4 0 10822 # 5 26.8 25.5 0 6491 # 6 27.3 23.2 13 5896 # 7 27.5 21.6 29 5451 # 8 27.1 20.3 52 5108 # 10 29.4 18.6 195 4526 # 11 29.2 18.0 421 4143 # 12 28.0 17.5 702 3738 # 15 28.9 16.5 1223 2969 # 20 29.6 15.7 2182 1810 # 30 31.4 15.4 3891 23 # Tuple of (num_repack_attempts, num_zsync_attempts) # num_zsync_attempts only has meaning if num_repack_attempts is 0. _repack_opts_for_speed = (0, 8) _repack_opts_for_size = (20, 0) def __init__( self, chunk_size: int, reserved: int = 0, optimize_for_size: bool = False ) -> None: """Create a ChunkWriter to write chunk_size chunks. :param chunk_size: The total byte count to emit at the end of the chunk. :param reserved: How many bytes to allow for reserved data. reserved data space can only be written to via the write(..., reserved=True). """ self.chunk_size = chunk_size self.compressor = zlib.compressobj() self.bytes_in: list[bytes] = [] self.bytes_list: list[bytes] = [] self.bytes_out_len = 0 # bytes that have been seen, but not included in a flush to out yet self.unflushed_in_bytes = 0 self.num_repack = 0 self.num_zsync = 0 self.unused_bytes: bytes | None = None self.reserved_size = reserved # Default is to make building fast rather than compact self.set_optimize(for_size=optimize_for_size) def finish(self) -> tuple[list[bytes], bytes | None, int]: """Finish the chunk. This returns the final compressed chunk, and either None, or the bytes that did not fit in the chunk. :return: (compressed_bytes, unused_bytes, num_nulls_needed) * compressed_bytes: a list of bytes that were output from the compressor. If the compressed length was not exactly chunk_size, the final string will be a string of all null bytes to pad this to chunk_size * unused_bytes: None, or the last bytes that were added, which we could not fit. * num_nulls_needed: How many nulls are padded at the end """ self.bytes_in = [] out = self.compressor.flush(Z_FINISH) self.bytes_list.append(out) self.bytes_out_len += len(out) if self.bytes_out_len > self.chunk_size: raise AssertionError( "Somehow we ended up with too much" " compressed data, %d > %d" % (self.bytes_out_len, self.chunk_size) ) nulls_needed = self.chunk_size - self.bytes_out_len if nulls_needed: self.bytes_list.append(b"\x00" * nulls_needed) return self.bytes_list, self.unused_bytes, nulls_needed def set_optimize(self, for_size: bool = True) -> None: """Change how we optimize our writes. :param for_size: If True, optimize for minimum space usage, otherwise optimize for fastest writing speed. :return: None """ if for_size: opts = ChunkWriter._repack_opts_for_size else: opts = ChunkWriter._repack_opts_for_speed self._max_repack, self._max_zsync = opts def _recompress_all_bytes_in( self, extra_bytes: bytes | None = None ) -> tuple[list[bytes], int, "zlib._Compress"]: """Recompress the current bytes_in, and optionally more. :param extra_bytes: Optional, if supplied we will add it with Z_SYNC_FLUSH :return: (bytes_out, bytes_out_len, alt_compressed) * bytes_out: is the compressed bytes returned from the compressor * bytes_out_len: the length of the compressed output * compressor: An object with everything packed in so far, and Z_SYNC_FLUSH called. """ compressor = zlib.compressobj() bytes_out: list[bytes] = [] append = bytes_out.append compress = compressor.compress for accepted_bytes in self.bytes_in: out = compress(accepted_bytes) if out: append(out) if extra_bytes: out = compress(extra_bytes) out += compressor.flush(Z_SYNC_FLUSH) append(out) bytes_out_len = sum(map(len, bytes_out)) return bytes_out, bytes_out_len, compressor def write(self, bytes: bytes, reserved: bool = False) -> bool: """Write some bytes to the chunk. If the bytes fit, False is returned. Otherwise True is returned and the bytes have not been added to the chunk. :param bytes: The bytes to include :param reserved: If True, we can use the space reserved in the constructor. """ if self.num_repack > self._max_repack and not reserved: self.unused_bytes = bytes return True capacity = self.chunk_size if reserved else self.chunk_size - self.reserved_size comp = self.compressor # Check to see if the currently unflushed bytes would fit with a bit of # room to spare, assuming no compression. next_unflushed = self.unflushed_in_bytes + len(bytes) remaining_capacity = capacity - self.bytes_out_len - 10 if next_unflushed < remaining_capacity: # looks like it will fit out = comp.compress(bytes) if out: self.bytes_list.append(out) self.bytes_out_len += len(out) self.bytes_in.append(bytes) self.unflushed_in_bytes += len(bytes) else: # This may or may not fit, try to add it with Z_SYNC_FLUSH # Note: It is tempting to do this as a look-ahead pass, and to # 'copy()' the compressor before flushing. However, it seems # that Which means that it is the same thing as increasing # repack, similar cost, same benefit. And this way we still # have the 'repack' knob that can be adjusted, and not depend # on a platform-specific 'copy()' function. self.num_zsync += 1 if self._max_repack == 0 and self.num_zsync > self._max_zsync: self.num_repack += 1 self.unused_bytes = bytes return True out = comp.compress(bytes) out += comp.flush(Z_SYNC_FLUSH) self.unflushed_in_bytes = 0 if out: self.bytes_list.append(out) self.bytes_out_len += len(out) # We are a bit extra conservative, because it seems that you *can* # get better compression with Z_SYNC_FLUSH than a full compress. It # is probably very rare, but we were able to trigger it. safety_margin = 100 if self.num_repack == 0 else 10 if self.bytes_out_len + safety_margin <= capacity: # It fit, so mark it added self.bytes_in.append(bytes) else: # We are over budget, try to squeeze this in without any # Z_SYNC_FLUSH calls self.num_repack += 1 (bytes_out, this_len, compressor) = self._recompress_all_bytes_in(bytes) if self.num_repack >= self._max_repack: # When we get *to* _max_repack, bump over so that the # earlier > _max_repack will be triggered. self.num_repack += 1 if this_len + 10 > capacity: (bytes_out, this_len, compressor) = self._recompress_all_bytes_in() self.compressor = compressor # Force us to not allow more data self.num_repack = self._max_repack + 1 self.bytes_list = bytes_out self.bytes_out_len = this_len self.unused_bytes = bytes return True else: # This fits when we pack it tighter, so use the new packing self.compressor = compressor self.bytes_in.append(bytes) self.bytes_list = bytes_out self.bytes_out_len = this_len return False bzrformats_3.4.0.orig/bzrformats/delta.h0000644000000000000000000001304115162073400015306 0ustar00/* * delta.h: headers for delta functionality * * Adapted from GIT for Bazaar by * John Arbash Meinel (C) 2009 * * This code is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. */ #ifndef DELTA_H #define DELTA_H /* opaque object for delta index */ struct delta_index; struct source_info { const void *buf; /* Pointer to the beginning of source data */ unsigned long size; /* Total length of source data */ unsigned long agg_offset; /* Start of source data as part of the aggregate source */ }; /* result type for functions that have multiple failure modes */ typedef enum { DELTA_OK, /* Success */ DELTA_OUT_OF_MEMORY, /* Could not allocate required memory */ DELTA_INDEX_NEEDED, /* A delta_index must be passed */ DELTA_SOURCE_EMPTY, /* A source_info had no content */ DELTA_SOURCE_BAD, /* A source_info had invalid or corrupt content */ DELTA_BUFFER_EMPTY, /* A buffer pointer and size */ DELTA_SIZE_TOO_BIG, /* Delta data is larger than the max requested */ } delta_result; /* * create_delta_index: compute index data from given buffer * * Returns a delta_result status, when DELTA_OK then *fresh is set to a struct * delta_index that should be passed to subsequent create_delta() calls, or to * free_delta_index(). Other values are a failure, and *fresh is unset. * The given buffer must not be freed nor altered before free_delta_index() is * called. The resultant struct must be freed using free_delta_index(). * * :param max_bytes_to_index: Limit the number of regions to sample to this * amount of text. We will store at most max_bytes_to_index / RABIN_WINDOW * pointers into the source text. Useful if src can be unbounded in size, * and you are willing to trade match accuracy for peak memory. */ extern delta_result create_delta_index(const struct source_info *src, struct delta_index *old, struct delta_index **fresh, int max_bytes_to_index); /* * create_delta_index_from_delta: compute index data from given buffer * * Returns a delta_result status, when DELTA_OK then *fresh is set to a struct * delta_index that should be passed to subsequent create_delta() calls, or to * free_delta_index(). Other values are a failure, and *fresh is unset. * The bytes must be in the form of a delta structure, as generated by * create_delta(). The generated index will only index the insert bytes, and * not any of the control structures. */ extern delta_result create_delta_index_from_delta(const struct source_info *delta, struct delta_index *old, struct delta_index **fresh); /* * free_delta_index: free the index created by create_delta_index() * * Given pointer must be what create_delta_index() returned, or NULL. */ extern void free_delta_index(struct delta_index *index); /* * sizeof_delta_index: returns memory usage of delta index * * Given pointer must be what create_delta_index() returned, or NULL. */ extern unsigned long sizeof_delta_index(struct delta_index *index); /* * create_delta: create a delta from given index for the given buffer * * This function may be called multiple times with different buffers using * the same delta_index pointer. If max_delta_size is non-zero and the * resulting delta is to be larger than max_delta_size then DELTA_SIZE_TOO_BIG * is returned. Otherwise on success, DELTA_OK is returned and *delta_data is * set to a new buffer with the delta data and *delta_size is updated with its * size. That buffer must be freed by the caller. */ extern delta_result create_delta(const struct delta_index *index, const void *buf, unsigned long bufsize, unsigned long *delta_size, unsigned long max_delta_size, void **delta_data); /* the smallest possible delta size is 3 bytes * Target size, Copy command, Copy length */ #define DELTA_SIZE_MIN 3 /* * This must be called twice on the delta data buffer, first to get the * expected source buffer size, and again to get the target buffer size. */ static unsigned long get_delta_hdr_size(unsigned char **datap, const unsigned char *top) { unsigned char *data = *datap; unsigned char cmd; unsigned long size = 0; int i = 0; do { cmd = *data++; size |= (cmd & ~0x80) << i; i += 7; } while (cmd & 0x80 && data < top); *datap = data; return size; } /* * Return the basic information about a given delta index. * :param index: The delta_index object * :param pos: The offset in the entry list. Start at 0, and walk until you get * 0 as a return code. * :param global_offset: return value, distance to the beginning of all sources * :param hash_val: return value, the RABIN hash associated with this pointer * :param hash_offset: Location for this entry in the hash array. * :return: 1 if pos != -1 (there was data produced) */ extern int get_entry_summary(const struct delta_index *index, int pos, unsigned int *text_offset, unsigned int *hash_val); /* * Determine what entry index->hash[X] points to. */ extern int get_hash_offset(const struct delta_index *index, int pos, unsigned int *entry_offset); /* * Compute the rabin_hash of the given data, it is assumed the data is at least * RABIN_WINDOW wide (16 bytes). */ extern unsigned int rabin_hash(const unsigned char *data); #endif bzrformats_3.4.0.orig/bzrformats/diff-delta.c0000644000000000000000000012707615162073400016225 0ustar00/* * diff-delta.c: generate a delta between two buffers * * This code was greatly inspired by parts of LibXDiff from Davide Libenzi * http://www.xmailserver.org/xdiff-lib.html * * Rewritten for GIT by Nicolas Pitre , (C) 2005-2007 * Adapted for Bazaar by John Arbash Meinel (C) 2009 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * NB: The version in GIT is 'version 2 of the Licence only', however Nicolas * has granted permission for use under 'version 2 or later' in private email * to Robert Collins and Karl Fogel on the 6th April 2009. */ #include #include "delta.h" #include #include #include /* maximum hash entry list for the same hash bucket */ #define HASH_LIMIT 64 #define RABIN_SHIFT 23 #define RABIN_WINDOW 16 /* The hash map is sized to put 4 entries per bucket, this gives us ~even room * for more data. Tweaking this number above 4 doesn't seem to help much, * anyway. */ #define EXTRA_NULLS 4 static const unsigned int T[256] = { 0x00000000, 0xab59b4d1, 0x56b369a2, 0xfdeadd73, 0x063f6795, 0xad66d344, 0x508c0e37, 0xfbd5bae6, 0x0c7ecf2a, 0xa7277bfb, 0x5acda688, 0xf1941259, 0x0a41a8bf, 0xa1181c6e, 0x5cf2c11d, 0xf7ab75cc, 0x18fd9e54, 0xb3a42a85, 0x4e4ef7f6, 0xe5174327, 0x1ec2f9c1, 0xb59b4d10, 0x48719063, 0xe32824b2, 0x1483517e, 0xbfdae5af, 0x423038dc, 0xe9698c0d, 0x12bc36eb, 0xb9e5823a, 0x440f5f49, 0xef56eb98, 0x31fb3ca8, 0x9aa28879, 0x6748550a, 0xcc11e1db, 0x37c45b3d, 0x9c9defec, 0x6177329f, 0xca2e864e, 0x3d85f382, 0x96dc4753, 0x6b369a20, 0xc06f2ef1, 0x3bba9417, 0x90e320c6, 0x6d09fdb5, 0xc6504964, 0x2906a2fc, 0x825f162d, 0x7fb5cb5e, 0xd4ec7f8f, 0x2f39c569, 0x846071b8, 0x798aaccb, 0xd2d3181a, 0x25786dd6, 0x8e21d907, 0x73cb0474, 0xd892b0a5, 0x23470a43, 0x881ebe92, 0x75f463e1, 0xdeadd730, 0x63f67950, 0xc8afcd81, 0x354510f2, 0x9e1ca423, 0x65c91ec5, 0xce90aa14, 0x337a7767, 0x9823c3b6, 0x6f88b67a, 0xc4d102ab, 0x393bdfd8, 0x92626b09, 0x69b7d1ef, 0xc2ee653e, 0x3f04b84d, 0x945d0c9c, 0x7b0be704, 0xd05253d5, 0x2db88ea6, 0x86e13a77, 0x7d348091, 0xd66d3440, 0x2b87e933, 0x80de5de2, 0x7775282e, 0xdc2c9cff, 0x21c6418c, 0x8a9ff55d, 0x714a4fbb, 0xda13fb6a, 0x27f92619, 0x8ca092c8, 0x520d45f8, 0xf954f129, 0x04be2c5a, 0xafe7988b, 0x5432226d, 0xff6b96bc, 0x02814bcf, 0xa9d8ff1e, 0x5e738ad2, 0xf52a3e03, 0x08c0e370, 0xa39957a1, 0x584ced47, 0xf3155996, 0x0eff84e5, 0xa5a63034, 0x4af0dbac, 0xe1a96f7d, 0x1c43b20e, 0xb71a06df, 0x4ccfbc39, 0xe79608e8, 0x1a7cd59b, 0xb125614a, 0x468e1486, 0xedd7a057, 0x103d7d24, 0xbb64c9f5, 0x40b17313, 0xebe8c7c2, 0x16021ab1, 0xbd5bae60, 0x6cb54671, 0xc7ecf2a0, 0x3a062fd3, 0x915f9b02, 0x6a8a21e4, 0xc1d39535, 0x3c394846, 0x9760fc97, 0x60cb895b, 0xcb923d8a, 0x3678e0f9, 0x9d215428, 0x66f4eece, 0xcdad5a1f, 0x3047876c, 0x9b1e33bd, 0x7448d825, 0xdf116cf4, 0x22fbb187, 0x89a20556, 0x7277bfb0, 0xd92e0b61, 0x24c4d612, 0x8f9d62c3, 0x7836170f, 0xd36fa3de, 0x2e857ead, 0x85dcca7c, 0x7e09709a, 0xd550c44b, 0x28ba1938, 0x83e3ade9, 0x5d4e7ad9, 0xf617ce08, 0x0bfd137b, 0xa0a4a7aa, 0x5b711d4c, 0xf028a99d, 0x0dc274ee, 0xa69bc03f, 0x5130b5f3, 0xfa690122, 0x0783dc51, 0xacda6880, 0x570fd266, 0xfc5666b7, 0x01bcbbc4, 0xaae50f15, 0x45b3e48d, 0xeeea505c, 0x13008d2f, 0xb85939fe, 0x438c8318, 0xe8d537c9, 0x153feaba, 0xbe665e6b, 0x49cd2ba7, 0xe2949f76, 0x1f7e4205, 0xb427f6d4, 0x4ff24c32, 0xe4abf8e3, 0x19412590, 0xb2189141, 0x0f433f21, 0xa41a8bf0, 0x59f05683, 0xf2a9e252, 0x097c58b4, 0xa225ec65, 0x5fcf3116, 0xf49685c7, 0x033df00b, 0xa86444da, 0x558e99a9, 0xfed72d78, 0x0502979e, 0xae5b234f, 0x53b1fe3c, 0xf8e84aed, 0x17bea175, 0xbce715a4, 0x410dc8d7, 0xea547c06, 0x1181c6e0, 0xbad87231, 0x4732af42, 0xec6b1b93, 0x1bc06e5f, 0xb099da8e, 0x4d7307fd, 0xe62ab32c, 0x1dff09ca, 0xb6a6bd1b, 0x4b4c6068, 0xe015d4b9, 0x3eb80389, 0x95e1b758, 0x680b6a2b, 0xc352defa, 0x3887641c, 0x93ded0cd, 0x6e340dbe, 0xc56db96f, 0x32c6cca3, 0x999f7872, 0x6475a501, 0xcf2c11d0, 0x34f9ab36, 0x9fa01fe7, 0x624ac294, 0xc9137645, 0x26459ddd, 0x8d1c290c, 0x70f6f47f, 0xdbaf40ae, 0x207afa48, 0x8b234e99, 0x76c993ea, 0xdd90273b, 0x2a3b52f7, 0x8162e626, 0x7c883b55, 0xd7d18f84, 0x2c043562, 0x875d81b3, 0x7ab75cc0, 0xd1eee811 }; static const unsigned int U[256] = { 0x00000000, 0x7eb5200d, 0x5633f4cb, 0x2886d4c6, 0x073e5d47, 0x798b7d4a, 0x510da98c, 0x2fb88981, 0x0e7cba8e, 0x70c99a83, 0x584f4e45, 0x26fa6e48, 0x0942e7c9, 0x77f7c7c4, 0x5f711302, 0x21c4330f, 0x1cf9751c, 0x624c5511, 0x4aca81d7, 0x347fa1da, 0x1bc7285b, 0x65720856, 0x4df4dc90, 0x3341fc9d, 0x1285cf92, 0x6c30ef9f, 0x44b63b59, 0x3a031b54, 0x15bb92d5, 0x6b0eb2d8, 0x4388661e, 0x3d3d4613, 0x39f2ea38, 0x4747ca35, 0x6fc11ef3, 0x11743efe, 0x3eccb77f, 0x40799772, 0x68ff43b4, 0x164a63b9, 0x378e50b6, 0x493b70bb, 0x61bda47d, 0x1f088470, 0x30b00df1, 0x4e052dfc, 0x6683f93a, 0x1836d937, 0x250b9f24, 0x5bbebf29, 0x73386bef, 0x0d8d4be2, 0x2235c263, 0x5c80e26e, 0x740636a8, 0x0ab316a5, 0x2b7725aa, 0x55c205a7, 0x7d44d161, 0x03f1f16c, 0x2c4978ed, 0x52fc58e0, 0x7a7a8c26, 0x04cfac2b, 0x73e5d470, 0x0d50f47d, 0x25d620bb, 0x5b6300b6, 0x74db8937, 0x0a6ea93a, 0x22e87dfc, 0x5c5d5df1, 0x7d996efe, 0x032c4ef3, 0x2baa9a35, 0x551fba38, 0x7aa733b9, 0x041213b4, 0x2c94c772, 0x5221e77f, 0x6f1ca16c, 0x11a98161, 0x392f55a7, 0x479a75aa, 0x6822fc2b, 0x1697dc26, 0x3e1108e0, 0x40a428ed, 0x61601be2, 0x1fd53bef, 0x3753ef29, 0x49e6cf24, 0x665e46a5, 0x18eb66a8, 0x306db26e, 0x4ed89263, 0x4a173e48, 0x34a21e45, 0x1c24ca83, 0x6291ea8e, 0x4d29630f, 0x339c4302, 0x1b1a97c4, 0x65afb7c9, 0x446b84c6, 0x3adea4cb, 0x1258700d, 0x6ced5000, 0x4355d981, 0x3de0f98c, 0x15662d4a, 0x6bd30d47, 0x56ee4b54, 0x285b6b59, 0x00ddbf9f, 0x7e689f92, 0x51d01613, 0x2f65361e, 0x07e3e2d8, 0x7956c2d5, 0x5892f1da, 0x2627d1d7, 0x0ea10511, 0x7014251c, 0x5facac9d, 0x21198c90, 0x099f5856, 0x772a785b, 0x4c921c31, 0x32273c3c, 0x1aa1e8fa, 0x6414c8f7, 0x4bac4176, 0x3519617b, 0x1d9fb5bd, 0x632a95b0, 0x42eea6bf, 0x3c5b86b2, 0x14dd5274, 0x6a687279, 0x45d0fbf8, 0x3b65dbf5, 0x13e30f33, 0x6d562f3e, 0x506b692d, 0x2ede4920, 0x06589de6, 0x78edbdeb, 0x5755346a, 0x29e01467, 0x0166c0a1, 0x7fd3e0ac, 0x5e17d3a3, 0x20a2f3ae, 0x08242768, 0x76910765, 0x59298ee4, 0x279caee9, 0x0f1a7a2f, 0x71af5a22, 0x7560f609, 0x0bd5d604, 0x235302c2, 0x5de622cf, 0x725eab4e, 0x0ceb8b43, 0x246d5f85, 0x5ad87f88, 0x7b1c4c87, 0x05a96c8a, 0x2d2fb84c, 0x539a9841, 0x7c2211c0, 0x029731cd, 0x2a11e50b, 0x54a4c506, 0x69998315, 0x172ca318, 0x3faa77de, 0x411f57d3, 0x6ea7de52, 0x1012fe5f, 0x38942a99, 0x46210a94, 0x67e5399b, 0x19501996, 0x31d6cd50, 0x4f63ed5d, 0x60db64dc, 0x1e6e44d1, 0x36e89017, 0x485db01a, 0x3f77c841, 0x41c2e84c, 0x69443c8a, 0x17f11c87, 0x38499506, 0x46fcb50b, 0x6e7a61cd, 0x10cf41c0, 0x310b72cf, 0x4fbe52c2, 0x67388604, 0x198da609, 0x36352f88, 0x48800f85, 0x6006db43, 0x1eb3fb4e, 0x238ebd5d, 0x5d3b9d50, 0x75bd4996, 0x0b08699b, 0x24b0e01a, 0x5a05c017, 0x728314d1, 0x0c3634dc, 0x2df207d3, 0x534727de, 0x7bc1f318, 0x0574d315, 0x2acc5a94, 0x54797a99, 0x7cffae5f, 0x024a8e52, 0x06852279, 0x78300274, 0x50b6d6b2, 0x2e03f6bf, 0x01bb7f3e, 0x7f0e5f33, 0x57888bf5, 0x293dabf8, 0x08f998f7, 0x764cb8fa, 0x5eca6c3c, 0x207f4c31, 0x0fc7c5b0, 0x7172e5bd, 0x59f4317b, 0x27411176, 0x1a7c5765, 0x64c97768, 0x4c4fa3ae, 0x32fa83a3, 0x1d420a22, 0x63f72a2f, 0x4b71fee9, 0x35c4dee4, 0x1400edeb, 0x6ab5cde6, 0x42331920, 0x3c86392d, 0x133eb0ac, 0x6d8b90a1, 0x450d4467, 0x3bb8646a }; struct index_entry { const unsigned char *ptr; const struct source_info *src; unsigned int val; }; struct index_entry_linked_list { struct index_entry *p_entry; struct index_entry_linked_list *next; }; struct unpacked_index_entry { struct index_entry entry; struct unpacked_index_entry *next; }; struct delta_index { unsigned long memsize; /* Total bytes pointed to by this index */ const struct source_info *last_src; /* Information about the referenced source */ unsigned int hash_mask; /* val & hash_mask gives the hash index for a given entry */ unsigned int num_entries; /* The total number of entries in this index */ struct index_entry *last_entry; /* Pointer to the last valid entry */ struct index_entry *hash[]; }; static unsigned int limit_hash_buckets(struct unpacked_index_entry **hash, unsigned int *hash_count, unsigned int hsize, unsigned int entries) { struct unpacked_index_entry *entry; unsigned int i; /* * Determine a limit on the number of entries in the same hash * bucket. This guards us against pathological data sets causing * really bad hash distribution with most entries in the same hash * bucket that would bring us to O(m*n) computing costs (m and n * corresponding to reference and target buffer sizes). * * Make sure none of the hash buckets has more entries than * we're willing to test. Otherwise we cull the entry list * uniformly to still preserve a good repartition across * the reference buffer. */ for (i = 0; i < hsize; i++) { int acc; if (hash_count[i] <= HASH_LIMIT) continue; /* We leave exactly HASH_LIMIT entries in the bucket */ entries -= hash_count[i] - HASH_LIMIT; entry = hash[i]; acc = 0; /* * Assume that this loop is gone through exactly * HASH_LIMIT times and is entered and left with * acc==0. So the first statement in the loop * contributes (hash_count[i]-HASH_LIMIT)*HASH_LIMIT * to the accumulator, and the inner loop consequently * is run (hash_count[i]-HASH_LIMIT) times, removing * one element from the list each time. Since acc * balances out to 0 at the final run, the inner loop * body can't be left with entry==NULL. So we indeed * encounter entry==NULL in the outer loop only. */ do { acc += hash_count[i] - HASH_LIMIT; if (acc > 0) { struct unpacked_index_entry *keep = entry; do { entry = entry->next; acc -= HASH_LIMIT; } while (acc > 0); keep->next = entry->next; } entry = entry->next; } while (entry); } return entries; } static struct delta_index * pack_delta_index(struct unpacked_index_entry **hash, unsigned int hsize, unsigned int num_entries, struct delta_index *old_index) { unsigned int i, j, hmask, memsize, fit_in_old, copied_count; struct unpacked_index_entry *entry; struct delta_index *index; struct index_entry *packed_entry, **packed_hash, *old_entry; struct index_entry null_entry = {0}; void *mem; hmask = hsize - 1; // if (old_index) { // fprintf(stderr, "Packing %d entries into %d for total of %d entries" // " %x => %x\n", // num_entries - old_index->num_entries, // old_index->num_entries, num_entries, // old_index->hash_mask, hmask); // } else { // fprintf(stderr, "Packing %d entries into a new index\n", // num_entries); // } /* First, see if we can squeeze the new items into the existing structure. */ fit_in_old = 0; copied_count = 0; if (old_index && old_index->hash_mask == hmask) { fit_in_old = 1; for (i = 0; i < hsize; ++i) { packed_entry = NULL; for (entry = hash[i]; entry; entry = entry->next) { if (packed_entry == NULL) { /* Find the last open spot */ packed_entry = old_index->hash[i + 1]; --packed_entry; while (packed_entry >= old_index->hash[i] && packed_entry->ptr == NULL) { --packed_entry; } ++packed_entry; } if (packed_entry >= old_index->hash[i+1] || packed_entry->ptr != NULL) { /* There are no free spots here :( */ fit_in_old = 0; break; } /* We found an empty spot to put this entry * Copy it over, and remove it from the linked list, just in * case we end up running out of room later. */ *packed_entry++ = entry->entry; assert(entry == hash[i]); hash[i] = entry->next; copied_count += 1; old_index->num_entries++; } if (!fit_in_old) { break; } } } if (old_index) { if (fit_in_old) { // fprintf(stderr, "Fit all %d entries into old index\n", // copied_count); /* * No need to allocate a new buffer, but return old_index ptr so * callers can distinguish this from an OOM failure. */ return old_index; } else { // fprintf(stderr, "Fit only %d entries into old index," // " reallocating\n", copied_count); } } /* * Now create the packed index in array form * rather than linked lists. * Leave a 2-entry gap for inserting more entries between the groups */ memsize = sizeof(*index) + sizeof(*packed_hash) * (hsize+1) + sizeof(*packed_entry) * (num_entries + hsize * EXTRA_NULLS); mem = malloc(memsize); if (!mem) { return NULL; } index = mem; index->memsize = memsize; index->hash_mask = hmask; index->num_entries = num_entries; if (old_index) { if (hmask < old_index->hash_mask) { fprintf(stderr, "hash mask was shrunk %x => %x\n", old_index->hash_mask, hmask); } assert(hmask >= old_index->hash_mask); } mem = index->hash; packed_hash = mem; mem = packed_hash + (hsize+1); packed_entry = mem; for (i = 0; i < hsize; i++) { /* * Coalesce all entries belonging to one linked list * into consecutive array entries. */ packed_hash[i] = packed_entry; /* Old comes earlier as a source, so it always comes first in a given * hash bucket. */ if (old_index) { /* Could we optimize this to use memcpy when hmask == * old_index->hash_mask? Would it make any real difference? */ j = i & old_index->hash_mask; for (old_entry = old_index->hash[j]; old_entry < old_index->hash[j + 1] && old_entry->ptr != NULL; old_entry++) { if ((old_entry->val & hmask) == i) { *packed_entry++ = *old_entry; } } } for (entry = hash[i]; entry; entry = entry->next) { *packed_entry++ = entry->entry; } /* TODO: At this point packed_entry - packed_hash[i] is the number of * records that we have inserted into this hash bucket. * We should *really* consider doing some limiting along the * lines of limit_hash_buckets() to avoid pathological behavior. */ /* Now add extra 'NULL' entries that we can use for future expansion. */ for (j = 0; j < EXTRA_NULLS; ++j ) { *packed_entry++ = null_entry; } } /* Sentinel value to indicate the length of the last hash bucket */ packed_hash[hsize] = packed_entry; if (packed_entry - (struct index_entry *)mem != num_entries + hsize*EXTRA_NULLS) { fprintf(stderr, "We expected %d entries, but created %d\n", num_entries + hsize*EXTRA_NULLS, (int)(packed_entry - (struct index_entry*)mem)); } assert(packed_entry - (struct index_entry *)mem == num_entries + hsize*EXTRA_NULLS); index->last_entry = (packed_entry - 1); return index; } delta_result create_delta_index(const struct source_info *src, struct delta_index *old, struct delta_index **fresh, int max_bytes_to_index) { unsigned int i, hsize, hmask, num_entries, prev_val, *hash_count; unsigned int total_num_entries, stride, max_entries; const unsigned char *data, *buffer; struct delta_index *index; struct unpacked_index_entry *entry, **hash; void *mem; unsigned long memsize; if (!src->buf || !src->size) return DELTA_SOURCE_EMPTY; buffer = src->buf; /* Determine index hash size. Note that indexing skips the first byte so we subtract 1 to get the edge cases right. */ stride = RABIN_WINDOW; num_entries = (src->size - 1) / RABIN_WINDOW; if (max_bytes_to_index > 0) { max_entries = (unsigned int) (max_bytes_to_index / RABIN_WINDOW); if (num_entries > max_entries) { /* Limit the max number of matching entries. This reduces the 'best' * possible match, but means we don't consume all of ram. */ num_entries = max_entries; stride = (src->size - 1) / num_entries; } } if (old != NULL) total_num_entries = num_entries + old->num_entries; else total_num_entries = num_entries; hsize = total_num_entries / 4; for (i = 4; (1u << i) < hsize && i < 31; i++); hsize = 1 << i; hmask = hsize - 1; if (old && old->hash_mask > hmask) { hmask = old->hash_mask; hsize = hmask + 1; } /* allocate lookup index */ memsize = sizeof(*hash) * hsize + sizeof(*entry) * total_num_entries; mem = malloc(memsize); if (!mem) return DELTA_OUT_OF_MEMORY; hash = mem; mem = hash + hsize; entry = mem; memset(hash, 0, hsize * sizeof(*hash)); /* allocate an array to count hash num_entries */ hash_count = calloc(hsize, sizeof(*hash_count)); if (!hash_count) { free(hash); return DELTA_OUT_OF_MEMORY; } /* then populate the index for the new data */ prev_val = ~0; for (data = buffer + num_entries * stride - RABIN_WINDOW; data >= buffer; data -= stride) { unsigned int val = 0; for (i = 1; i <= RABIN_WINDOW; i++) val = ((val << 8) | data[i]) ^ T[val >> RABIN_SHIFT]; if (val == prev_val) { /* keep the lowest of consecutive identical blocks */ entry[-1].entry.ptr = data + RABIN_WINDOW; --num_entries; --total_num_entries; } else { prev_val = val; i = val & hmask; entry->entry.ptr = data + RABIN_WINDOW; entry->entry.val = val; entry->entry.src = src; entry->next = hash[i]; hash[i] = entry++; hash_count[i]++; } } /* TODO: It would be nice to limit_hash_buckets at a better time. */ total_num_entries = limit_hash_buckets(hash, hash_count, hsize, total_num_entries); free(hash_count); index = pack_delta_index(hash, hsize, total_num_entries, old); free(hash); /* pack_delta_index only returns NULL on malloc failure */ if (!index) { return DELTA_OUT_OF_MEMORY; } index->last_src = src; *fresh = index; return DELTA_OK; } /* Take some entries, and put them into a custom hash. * @param entries A list of entries, sorted by position in file * @param num_entries Length of entries * @param out_hsize The maximum size of the hash, the final size will be * returned here */ struct index_entry_linked_list ** _put_entries_into_hash(struct index_entry *entries, unsigned int num_entries, unsigned int hsize) { unsigned int hash_offset, hmask, memsize; struct index_entry *entry; struct index_entry_linked_list *out_entry, **hash; void *mem; hmask = hsize - 1; memsize = sizeof(*hash) * hsize + sizeof(*out_entry) * num_entries; mem = malloc(memsize); if (!mem) return NULL; hash = mem; mem = hash + hsize; out_entry = mem; memset(hash, 0, sizeof(*hash)*(hsize+1)); /* We know that entries are in the order we want in the output, but they * aren't "grouped" by hash bucket yet. */ for (entry = entries + num_entries - 1; entry >= entries; --entry) { hash_offset = entry->val & hmask; out_entry->p_entry = entry; out_entry->next = hash[hash_offset]; /* TODO: Remove entries that have identical vals, or at least filter * the map a little bit. * if (hash[i] != NULL) { * } */ hash[hash_offset] = out_entry; ++out_entry; } return hash; } struct delta_index * create_index_from_old_and_new_entries(const struct delta_index *old_index, struct index_entry *entries, unsigned int num_entries) { unsigned int i, j, hsize, hmask, total_num_entries; struct delta_index *index; struct index_entry *entry, *packed_entry, **packed_hash; struct index_entry null_entry = {0}; void *mem; unsigned long memsize; struct index_entry_linked_list *unpacked_entry, **mini_hash; /* Determine index hash size. Note that indexing skips the first byte to allow for optimizing the Rabin's polynomial initialization in create_delta(). */ total_num_entries = num_entries + old_index->num_entries; hsize = total_num_entries / 4; for (i = 4; (1u << i) < hsize && i < 31; i++); hsize = 1 << i; if (hsize < old_index->hash_mask) { /* For some reason, there was a code path that would actually *shrink* * the hash size. This screws with some later code, and in general, I * think it better to make the hash bigger, rather than smaller. So * we'll just force the size here. * Possibly done by create_delta_index running into a * limit_hash_buckets call, that ended up transitioning across a * power-of-2. The cause isn't 100% clear, though. */ hsize = old_index->hash_mask + 1; } hmask = hsize - 1; // fprintf(stderr, "resizing index to insert %d entries into array" // " with %d entries: %x => %x\n", // num_entries, old_index->num_entries, old_index->hash_mask, hmask); memsize = sizeof(*index) + sizeof(*packed_hash) * (hsize+1) + sizeof(*packed_entry) * (total_num_entries + hsize*EXTRA_NULLS); mem = malloc(memsize); if (!mem) { return NULL; } index = mem; index->memsize = memsize; index->hash_mask = hmask; index->num_entries = total_num_entries; index->last_src = old_index->last_src; mem = index->hash; packed_hash = mem; mem = packed_hash + (hsize+1); packed_entry = mem; mini_hash = _put_entries_into_hash(entries, num_entries, hsize); if (mini_hash == NULL) { free(index); return NULL; } for (i = 0; i < hsize; i++) { /* * Coalesce all entries belonging in one hash bucket * into consecutive array entries. * The entries in old_index all come before 'entries'. */ packed_hash[i] = packed_entry; /* Copy any of the old entries across */ /* Would we rather use memcpy? */ if (hmask == old_index->hash_mask) { for (entry = old_index->hash[i]; entry < old_index->hash[i+1] && entry->ptr != NULL; ++entry) { assert((entry->val & hmask) == i); *packed_entry++ = *entry; } } else { /* If we resized the index from this action, all of the old values * will be found in the previous location, but they will end up * spread across the new locations. */ j = i & old_index->hash_mask; for (entry = old_index->hash[j]; entry < old_index->hash[j+1] && entry->ptr != NULL; ++entry) { assert((entry->val & old_index->hash_mask) == j); if ((entry->val & hmask) == i) { /* Any entries not picked up here will be picked up on the * next pass. */ *packed_entry++ = *entry; } } } /* Now see if we need to insert any of the new entries. * Note that loop ends up O(hsize*num_entries), so we expect that * num_entries is always small. * We also help a little bit by collapsing the entry range when the * endpoints are inserted. However, an alternative would be to build a * quick hash lookup for just the new entries. * Testing shows that this list can easily get up to about 100 * entries, the tradeoff is a malloc, 1 pass over the entries, copying * them into a sorted buffer, and a free() when done, */ for (unpacked_entry = mini_hash[i]; unpacked_entry; unpacked_entry = unpacked_entry->next) { assert((unpacked_entry->p_entry->val & hmask) == i); *packed_entry++ = *(unpacked_entry->p_entry); } /* Now insert some extra nulls */ for (j = 0; j < EXTRA_NULLS; ++j) { *packed_entry++ = null_entry; } } free(mini_hash); /* Sentinel value to indicate the length of the last hash bucket */ packed_hash[hsize] = packed_entry; if ((packed_entry - (struct index_entry *)mem) != (total_num_entries + hsize*EXTRA_NULLS)) { fprintf(stderr, "We expected %d entries, but created %d\n", total_num_entries + hsize*EXTRA_NULLS, (int)(packed_entry - (struct index_entry*)mem)); fflush(stderr); } assert((packed_entry - (struct index_entry *)mem) == (total_num_entries + hsize * EXTRA_NULLS)); index->last_entry = (packed_entry - 1); return index; } void get_text(char buff[128], const unsigned char *ptr) { unsigned int i; const unsigned char *start; unsigned char cmd; start = (ptr-RABIN_WINDOW-1); cmd = *(start); if (cmd < 0x80) {// This is likely to be an insert instruction if (cmd < RABIN_WINDOW) { cmd = RABIN_WINDOW; } } else { /* This was either a copy [should never be] or it * was a longer insert so the insert start happened at 16 more * bytes back. */ cmd = RABIN_WINDOW + 1; } if (cmd > 60) { cmd = 60; /* Be friendly to 80char terms */ } /* Copy the 1 byte command, and 4 bytes after the insert */ cmd += 5; memcpy(buff, start, cmd); buff[cmd] = 0; for (i = 0; i < cmd; ++i) { if (buff[i] == '\n') { buff[i] = 'N'; } else if (buff[i] == '\t') { buff[i] = 'T'; } } } delta_result create_delta_index_from_delta(const struct source_info *src, struct delta_index *old_index, struct delta_index **fresh) { unsigned int i, num_entries, max_num_entries, prev_val, num_inserted; unsigned int hash_offset; const unsigned char *data, *buffer, *top; unsigned char cmd; struct delta_index *new_index; struct index_entry *entry, *entries; if (!old_index) return DELTA_INDEX_NEEDED; if (!src->buf || !src->size) return DELTA_SOURCE_EMPTY; buffer = src->buf; top = buffer + src->size; /* Determine index hash size. Note that indexing skips the first byte to allow for optimizing the Rabin's polynomial initialization in create_delta(). This computes the maximum number of entries that could be held. The actual number will be recomputed during processing. */ max_num_entries = (src->size - 1) / RABIN_WINDOW; if (!max_num_entries) { *fresh = old_index; return DELTA_OK; } /* allocate an array to hold whatever entries we find */ entries = malloc(sizeof(*entry) * max_num_entries); if (!entries) /* malloc failure */ return DELTA_OUT_OF_MEMORY; /* then populate the index for the new data */ prev_val = ~0; data = buffer; /* target size */ /* get_delta_hdr_size doesn't mutate the content, just moves the * start-of-data pointer, so it is safe to do the cast. */ get_delta_hdr_size((unsigned char**)&data, top); entry = entries; /* start at the first slot */ num_entries = 0; /* calculate the real number of entries */ while (data < top) { cmd = *data++; if (cmd & 0x80) { /* Copy instruction, skip it */ if (cmd & 0x01) data++; if (cmd & 0x02) data++; if (cmd & 0x04) data++; if (cmd & 0x08) data++; if (cmd & 0x10) data++; if (cmd & 0x20) data++; if (cmd & 0x40) data++; } else if (cmd) { /* Insert instruction, we want to index these bytes */ if (data + cmd > top) { /* Invalid insert, not enough bytes in the delta */ break; } /* The create_delta code requires a match at least 4 characters * (including only the last char of the RABIN_WINDOW) before it * will consider it something worth copying rather than inserting. * So we don't want to index anything that we know won't ever be a * match. */ for (; cmd > RABIN_WINDOW + 3; cmd -= RABIN_WINDOW, data += RABIN_WINDOW) { unsigned int val = 0; for (i = 1; i <= RABIN_WINDOW; i++) val = ((val << 8) | data[i]) ^ T[val >> RABIN_SHIFT]; if (val != prev_val) { /* Only keep the first of consecutive data */ prev_val = val; num_entries++; entry->ptr = data + RABIN_WINDOW; entry->val = val; entry->src = src; entry++; if (num_entries > max_num_entries) { /* We ran out of entry room, something is really wrong */ break; } } } /* Move the data pointer by whatever remainder is left */ data += cmd; } else { /* * cmd == 0 is reserved for future encoding * extensions. In the mean time we must fail when * encountering them (might be data corruption). */ break; } } if (data != top) { /* The source_info data passed was corrupted or otherwise invalid */ free(entries); return DELTA_SOURCE_BAD; } if (num_entries == 0) { /** Nothing to index **/ free(entries); *fresh = old_index; return DELTA_OK; } old_index->last_src = src; /* See if we can fill in these values into the holes in the array */ entry = entries; num_inserted = 0; for (; num_entries > 0; --num_entries, ++entry) { struct index_entry *next_bucket_entry, *cur_entry, *bucket_first_entry; hash_offset = (entry->val & old_index->hash_mask); /* The basic structure is a hash => packed_entries that fit in that * hash bucket. Things are structured such that the hash-pointers are * strictly ordered. So we start by pointing to the next pointer, and * walk back until we stop getting NULL targets, and then go back * forward. If there are no NULL targets, then we know because * entry->ptr will not be NULL. */ // The start of the next bucket, this may point past the end of the // entry table if hash_offset is the last bucket. next_bucket_entry = old_index->hash[hash_offset + 1]; // First entry in this bucket bucket_first_entry = old_index->hash[hash_offset]; cur_entry = next_bucket_entry - 1; while (cur_entry->ptr == NULL && cur_entry >= bucket_first_entry) { cur_entry--; } // cur_entry now either points at the first NULL, or it points to // next_bucket_entry if there were no blank spots. cur_entry++; if (cur_entry >= next_bucket_entry || cur_entry->ptr != NULL) { /* There is no room for this entry, we have to resize */ // char buff[128]; // get_text(buff, entry->ptr); // fprintf(stderr, "Failed to find an opening @%x for %8x:\n '%s'\n", // hash_offset, entry->val, buff); // for (old_entry = old_index->hash[hash_offset]; // old_entry < old_index->hash[hash_offset+1]; // ++old_entry) { // get_text(buff, old_entry->ptr); // fprintf(stderr, " [%2d] %8x %8x: '%s'\n", // (int)(old_entry - old_index->hash[hash_offset]), // old_entry->val, old_entry->ptr, buff); // } break; } num_inserted++; *cur_entry = *entry; /* For entries which we *do* manage to insert into old_index, we don't * want them double copied into the final output. */ old_index->num_entries++; } if (num_entries > 0) { /* We couldn't fit the new entries into the old index, so allocate a * new one, and fill it with stuff. */ // fprintf(stderr, "inserted %d before resize\n", num_inserted); new_index = create_index_from_old_and_new_entries(old_index, entry, num_entries); } else { new_index = old_index; // fprintf(stderr, "inserted %d without resizing\n", num_inserted); } free(entries); /* create_index_from_old_and_new_entries returns NULL on malloc failure */ if (!new_index) return DELTA_OUT_OF_MEMORY; *fresh = new_index; return DELTA_OK; } void free_delta_index(struct delta_index *index) { free(index); } unsigned long sizeof_delta_index(struct delta_index *index) { if (index) return index->memsize; else return 0; } /* * The maximum size for any opcode sequence, including the initial header * plus Rabin window plus biggest copy. */ #define MAX_OP_SIZE (5 + 5 + 1 + RABIN_WINDOW + 7) delta_result create_delta(const struct delta_index *index, const void *trg_buf, unsigned long trg_size, unsigned long *delta_size, unsigned long max_size, void **delta_data) { unsigned int i, outpos, outsize, moff, val; int msize; const struct source_info *msource; int inscnt; const unsigned char *ref_data, *ref_top, *data, *top; unsigned char *out; if (!trg_buf || !trg_size) return DELTA_BUFFER_EMPTY; if (index == NULL) return DELTA_INDEX_NEEDED; outpos = 0; outsize = 8192; if (max_size && outsize >= max_size) outsize = max_size + MAX_OP_SIZE + 1; out = malloc(outsize); if (!out) return DELTA_OUT_OF_MEMORY; /* store target buffer size */ i = trg_size; while (i >= 0x80) { out[outpos++] = i | 0x80; i >>= 7; } out[outpos++] = i; data = trg_buf; top = (const unsigned char *) trg_buf + trg_size; /* Start the matching by filling out with a simple 'insert' instruction, of * the first RABIN_WINDOW bytes of the input. */ outpos++; /* leave a byte for the insert command */ val = 0; for (i = 0; i < RABIN_WINDOW && data < top; i++, data++) { out[outpos++] = *data; val = ((val << 8) | *data) ^ T[val >> RABIN_SHIFT]; } /* we are now setup with an insert of 'i' bytes and val contains the RABIN * hash for those bytes, and data points to the RABIN_WINDOW+1 byte of * input. */ inscnt = i; moff = 0; msize = 0; msource = NULL; while (data < top) { if (msize < 4096) { /* we don't have a 'worthy enough' match yet, so let's look for * one. */ struct index_entry *entry; /* Shift the window by one byte. */ val ^= U[data[-RABIN_WINDOW]]; val = ((val << 8) | *data) ^ T[val >> RABIN_SHIFT]; i = val & index->hash_mask; /* TODO: When using multiple indexes like this, the hash tables * mapping val => index_entry become less efficient. * You end up getting a lot more collisions in the hash, * which doesn't actually lead to a entry->val match. */ for (entry = index->hash[i]; entry < index->hash[i+1] && entry->src != NULL; entry++) { const unsigned char *ref; const unsigned char *src; int ref_size; if (entry->val != val) continue; ref = entry->ptr; src = data; ref_data = entry->src->buf; ref_top = ref_data + entry->src->size; ref_size = ref_top - ref; /* ref_size is the longest possible match that we could make * here. If ref_size <= msize, then we know that we cannot * match more bytes with this location that we have already * matched. */ if (ref_size > (top - src)) ref_size = top - src; if (ref_size <= msize) break; /* See how many bytes actually match at this location. */ while (ref_size-- && *src++ == *ref) ref++; if (msize < (ref - entry->ptr)) { /* this is our best match so far */ msize = ref - entry->ptr; msource = entry->src; moff = entry->ptr - ref_data; if (msize >= 4096) /* good enough */ break; } } } if (msize < 4) { /* The best match right now is less than 4 bytes long. So just add * the current byte to the insert instruction. Increment the insert * counter, and copy the byte of data into the output buffer. */ if (!inscnt) outpos++; out[outpos++] = *data++; inscnt++; if (inscnt == 0x7f) { /* We have a max length insert instruction, finalize it in the * output. */ out[outpos - inscnt - 1] = inscnt; inscnt = 0; } msize = 0; } else { unsigned int left; unsigned char *op; if (inscnt) { ref_data = msource->buf; while (moff && ref_data[moff-1] == data[-1]) { /* we can match one byte back */ msize++; moff--; data--; outpos--; if (--inscnt) continue; outpos--; /* remove count slot */ inscnt--; /* make it -1 */ break; } out[outpos - inscnt - 1] = inscnt; inscnt = 0; } /* A copy op is currently limited to 64KB (pack v2) */ left = (msize < 0x10000) ? 0 : (msize - 0x10000); msize -= left; op = out + outpos++; i = 0x80; /* moff is the offset in the local structure, for encoding, we need * to push it into the global offset */ assert(moff < msource->size); moff += msource->agg_offset; assert(moff + msize <= index->last_src->size + index->last_src->agg_offset); if (moff & 0x000000ff) out[outpos++] = moff >> 0, i |= 0x01; if (moff & 0x0000ff00) out[outpos++] = moff >> 8, i |= 0x02; if (moff & 0x00ff0000) out[outpos++] = moff >> 16, i |= 0x04; if (moff & 0xff000000) out[outpos++] = moff >> 24, i |= 0x08; /* Put it back into local coordinates, in case we have multiple * copies in a row. */ moff -= msource->agg_offset; if (msize & 0x00ff) out[outpos++] = msize >> 0, i |= 0x10; if (msize & 0xff00) out[outpos++] = msize >> 8, i |= 0x20; *op = i; data += msize; moff += msize; msize = left; if (msize < 4096) { int j; val = 0; for (j = -RABIN_WINDOW; j < 0; j++) val = ((val << 8) | data[j]) ^ T[val >> RABIN_SHIFT]; } } if (outpos >= outsize - MAX_OP_SIZE) { void *tmp = out; outsize = outsize * 3 / 2; if (max_size && outsize >= max_size) outsize = max_size + MAX_OP_SIZE + 1; if (max_size && outpos > max_size) break; out = realloc(out, outsize); if (!out) { free(tmp); return DELTA_OUT_OF_MEMORY; } } } if (inscnt) out[outpos - inscnt - 1] = inscnt; if (max_size && outpos > max_size) { free(out); return DELTA_SIZE_TOO_BIG; } *delta_size = outpos; *delta_data = out; return DELTA_OK; } int get_entry_summary(const struct delta_index *index, int pos, unsigned int *text_offset, unsigned int *hash_val) { int hsize; const struct index_entry *entry; const struct index_entry *start_of_entries; unsigned int offset; if (pos < 0 || text_offset == NULL || hash_val == NULL || index == NULL) { return 0; } hsize = index->hash_mask + 1; start_of_entries = (struct index_entry *)(((struct index_entry **)index->hash) + (hsize + 1)); entry = start_of_entries + pos; if (entry > index->last_entry) { return 0; } if (entry->ptr == NULL) { *text_offset = 0; *hash_val = 0; } else { offset = entry->src->agg_offset; offset += (entry->ptr - ((unsigned char *)entry->src->buf)); *text_offset = offset; *hash_val = entry->val; } return 1; } int get_hash_offset(const struct delta_index *index, int pos, unsigned int *entry_offset) { int hsize; const struct index_entry *entry; const struct index_entry *start_of_entries; if (pos < 0 || index == NULL || entry_offset == NULL) { return 0; } hsize = index->hash_mask + 1; start_of_entries = (struct index_entry *)(((struct index_entry **)index->hash) + (hsize + 1)); if (pos >= hsize) { return 0; } entry = index->hash[pos]; if (entry == NULL) { *entry_offset = -1; } else { *entry_offset = (entry - start_of_entries); } return 1; } unsigned int rabin_hash(const unsigned char *data) { int i; unsigned int val = 0; for (i = 0; i < RABIN_WINDOW; i++) val = ((val << 8) | data[i]) ^ T[val >> RABIN_SHIFT]; return val; } /* vim: et ts=4 sw=4 sts=4 */ bzrformats_3.4.0.orig/bzrformats/diff.py0000644000000000000000000000210415162115103015321 0ustar00# Copyright (C) 2004-2011 Canonical Ltd. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Diff utilities for bzrformats.""" import difflib class _PrematchedMatcher(difflib.SequenceMatcher): """Allow SequenceMatcher operations to use predetermined blocks.""" def __init__(self, matching_blocks): difflib.SequenceMatcher(self, None, None) self.matching_blocks = matching_blocks self.opcodes = None bzrformats_3.4.0.orig/bzrformats/dirstate.py0000644000000000000000000063277015162115107016256 0ustar00# Copyright (C) 2006-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA r"""DirState objects record the state of a directory and its bzr metadata. Pseudo EBNF grammar for the state file. Fields are separated by NULLs, and lines by NL. The field delimiters are ommitted in the grammar, line delimiters are not - this is done for clarity of reading. All string data is in utf8. :: MINIKIND = "f" | "d" | "l" | "a" | "r" | "t"; NL = "\n"; NULL = "\0"; WHOLE_NUMBER = {digit}, digit; BOOLEAN = "y" | "n"; REVISION_ID = a non-empty utf8 string; dirstate format = header line, full checksum, row count, parent details, ghost_details, entries; header line = "#bazaar dirstate flat format 3", NL; full checksum = "crc32: ", ["-"], WHOLE_NUMBER, NL; row count = "num_entries: ", WHOLE_NUMBER, NL; parent_details = WHOLE NUMBER, {REVISION_ID}* NL; ghost_details = WHOLE NUMBER, {REVISION_ID}*, NL; entries = {entry}; entry = entry_key, current_entry_details, {parent_entry_details}; entry_key = dirname, basename, fileid; current_entry_details = common_entry_details, working_entry_details; parent_entry_details = common_entry_details, history_entry_details; common_entry_details = MINIKIND, fingerprint, size, executable working_entry_details = packed_stat history_entry_details = REVISION_ID; executable = BOOLEAN; size = WHOLE_NUMBER; fingerprint = a nonempty utf8 sequence with meaning defined by minikind. Given this definition, the following is useful to know:: entry (aka row) - all the data for a given key. entry[0]: The key (dirname, basename, fileid) entry[0][0]: dirname entry[0][1]: basename entry[0][2]: fileid entry[1]: The tree(s) data for this path and id combination. entry[1][0]: The current tree entry[1][1]: The second tree For an entry for a tree, we have (using tree 0 - current tree) to demonstrate:: entry[1][0][0]: minikind entry[1][0][1]: fingerprint entry[1][0][2]: size entry[1][0][3]: executable entry[1][0][4]: packed_stat OR (for non tree-0):: entry[1][1][4]: revision_id There may be multiple rows at the root, one per id present in the root, so the in memory root row is now:: self._dirblocks[0] -> ('', [entry ...]), and the entries in there are:: entries[0][0]: b'' entries[0][1]: b'' entries[0][2]: file_id entries[1][0]: The tree data for the current tree for this fileid at / etc. Kinds:: b'r' is a relocated entry: This path is not present in this tree with this id, but the id can be found at another location. The fingerprint is used to point to the target location. b'a' is an absent entry: In that tree the id is not present at this path. b'd' is a directory entry: This path in this tree is a directory with the current file id. There is no fingerprint for directories. b'f' is a file entry: As for directory, but it's a file. The fingerprint is the sha1 value of the file's canonical form, i.e. after any read filters have been applied to the convenience form stored in the working tree. b'l' is a symlink entry: As for directory, but a symlink. The fingerprint is the link target. b't' is a reference to a nested subtree; the fingerprint is the referenced revision. Ordering: The entries on disk and in memory are ordered according to the following keys:: directory, as a list of components filename file-id --- Format 1 had the following different definition: --- :: rows = dirname, NULL, basename, NULL, MINIKIND, NULL, fileid_utf8, NULL, WHOLE NUMBER (* size *), NULL, packed stat, NULL, sha1|symlink target, {PARENT ROW} PARENT ROW = NULL, revision_utf8, NULL, MINIKIND, NULL, dirname, NULL, basename, NULL, WHOLE NUMBER (* size *), NULL, "y" | "n", NULL, SHA1 PARENT ROW's are emitted for every parent that is not in the ghosts details line. That is, if the parents are foo, bar, baz, and the ghosts are bar, then each row will have a PARENT ROW for foo and baz, but not for bar. In any tree, a kind of 'moved' indicates that the fingerprint field (which we treat as opaque data specific to the 'kind' anyway) has the details for the id of this row in that tree. I'm strongly tempted to add a id->path index as well, but I think that where we need id->path mapping; we also usually read the whole file, so I'm going to skip that for the moment, as we have the ability to locate via bisect any path in any tree, and if we lookup things by path, we can accumulate an id->path mapping as we go, which will tend to match what we looked for. I plan to implement this asap, so please speak up now to alter/tweak the design - and once we stabilise on this, I'll update the wiki page for it. The rationale for all this is that we want fast operations for the common case (diff/status/commit/merge on all files) and extremely fast operations for the less common but still occurs a lot status/diff/commit on specific files). Operations on specific files involve a scan for all the children of a path, *in every involved tree*, which the current format did not accommodate. ---- Design priorities: 1. Fast end to end use for bzr's top 5 uses cases. (commmit/diff/status/merge/???) 2. fall back current object model as needed. 3. scale usably to the largest trees known today - say 50K entries. (mozilla is an example of this) Locking: Eventually reuse dirstate objects across locks IFF the dirstate file has not been modified, but will require that we flush/ignore cached stat-hit data because we won't want to restat all files on disk just because a lock was acquired, yet we cannot trust the data after the previous lock was released. Memory representation:: vector of all directories, and vector of the childen ? i.e. root_entries = (direntry for root, [parent_direntries_for_root]), dirblocks = [ ('', ['data for achild', 'data for bchild', 'data for cchild']) ('dir', ['achild', 'cchild', 'echild']) ] - single bisect to find N subtrees from a path spec - in-order for serialisation - this is 'dirblock' grouping. - insertion of a file '/a' affects only the '/' child-vector, that is, to insert 10K elements from scratch does not generates O(N^2) memoves of a single vector, rather each individual, which tends to be limited to a manageable number. Will scale badly on trees with 10K entries in a single directory. compare with Inventory.InventoryDirectory which has a dictionary for the children. No bisect capability, can only probe for exact matches, or grab all elements and sort. - What's the risk of error here? Once we have the base format being processed we should have a net win regardless of optimality. So we are going to go with what seems reasonable. open questions: Maybe we should do a test profile of the core structure - 10K simulated searches/lookups/etc? Objects for each row? The lifetime of Dirstate objects is current per lock, but see above for possible extensions. The lifetime of a row from a dirstate is expected to be very short in the optimistic case: which we are optimising for. For instance, subtree status will determine from analysis of the disk data what rows need to be examined at all, and will be able to determine from a single row whether that file has altered or not, so we are aiming to process tens of thousands of entries each second within the dirstate context, before exposing anything to the larger codebase. This suggests we want the time for a single file comparison to be < 0.1 milliseconds. That would give us 10000 paths per second processed, and to scale to 100 thousand we'll another order of magnitude to do that. Now, as the lifetime for all unchanged entries is the time to parse, stat the file on disk, and then immediately discard, the overhead of object creation becomes a significant cost. Figures: Creating a tuple from 3 elements was profiled at 0.0625 microseconds, whereas creating a object which is subclassed from tuple was 0.500 microseconds, and creating an object with 3 elements and slots was 3 microseconds long. 0.1 milliseconds is 100 microseconds, and ideally we'll get down to 10 microseconds for the total processing - having 33% of that be object creation is a huge overhead. There is a potential cost in using tuples within each row which is that the conditional code to do comparisons may be slower than method invocation, but method invocation is known to be slow due to stack frame creation, so avoiding methods in these tight inner loops in unfortunately desirable. We can consider a pyrex version of this with objects in future if desired. """ import bisect import codecs import contextlib import logging import operator import os import stat import sys import time from stat import S_IEXEC from . import inventory, lock, osutils from .errors import ( BadFileKindError, BzrFormatsError, InconsistentDelta, InconsistentDeltaDelta, InvalidNormalization, LockContention, LockNotHeld, NotVersionedError, ObjectNotLocked, ) from .osutils import _walkdirs_utf8, is_inside, is_inside_any, parent_directories logger = logging.getLogger("bzrformats.dirstate") evil_logger = logging.getLogger("bzrformats.evil") # This is the Windows equivalent of ENOTDIR # It is defined in pywin32.winerror, but we don't want a strong dependency for # just an error code. ERROR_PATH_NOT_FOUND = 3 ERROR_DIRECTORY = 267 class DirstateCorrupt(BzrFormatsError): """Exception raised when a dirstate file is corrupt.""" _fmt = "The dirstate file (%(state)s) appears to be corrupt: %(msg)s" def __init__(self, state, msg): """Create a DirstateCorrupt exception. Args: state: The dirstate that is corrupt. msg: Error message describing the corruption. """ super().__init__() self.state = state self.msg = msg class SHA1Provider: """An interface for getting sha1s of a file.""" def sha1(self, abspath): """Return the sha1 of a file given its absolute path. :param abspath: May be a filesystem encoded absolute path or a unicode path. """ raise NotImplementedError(self.sha1) def stat_and_sha1(self, abspath): """Return the stat and sha1 of a file given its absolute path. :param abspath: May be a filesystem encoded absolute path or a unicode path. Note: the stat should be the stat of the physical file while the sha may be the sha of its canonical content. """ raise NotImplementedError(self.stat_and_sha1) class DirstateInventoryChange: """Change information from dirstate that can be converted to InventoryTreeChange.""" def __init__( self, file_id, path, changed_content, versioned, parent_id, name, kind, executable, copied=False, ): """Initialize a DirstateInventoryChange. Args: file_id: The file ID of the changed item. path: Tuple of (old_path, new_path). changed_content: Whether content changed. versioned: Tuple of (old_versioned, new_versioned). parent_id: Tuple of (old_parent_id, new_parent_id). name: Tuple of (old_name, new_name). kind: Tuple of (old_kind, new_kind). executable: Tuple of (old_executable, new_executable). copied: Whether this represents a copy (default False). """ self.file_id = file_id self.path = path self.changed_content = changed_content self.versioned = versioned self.parent_id = parent_id self.name = name self.kind = kind self.executable = executable self.copied = copied def meta_modified(self): """Return true if the meta data has been modified.""" if self.versioned == (True, True): return self.executable[0] != self.executable[1] return False def is_reparented(self): """Return whether the entry has been moved to a different parent.""" return self.parent_id[0] != self.parent_id[1] @property def renamed(self): """Return true if the entry has been renamed.""" return ( not self.copied and None not in self.name and None not in self.parent_id and (self.name[0] != self.name[1] or self.parent_id[0] != self.parent_id[1]) ) def discard_new(self): """Return a copy of this delta with the new side discarded.""" return self.__class__( self.file_id, (self.path[0], None), self.changed_content, (self.versioned[0], None), (self.parent_id[0], None), (self.name[0], None), (self.kind[0], None), (self.executable[0], None), copied=False, ) def _as_tuple(self): return ( self.file_id, self.path, self.changed_content, self.versioned, self.parent_id, self.name, self.kind, self.executable, self.copied, ) def __repr__(self): """Return string representation.""" return f"{self.__class__.__name__}{self._as_tuple()!r}" def __getitem__(self, index): """Return item at index.""" return self._as_tuple()[index] def __eq__(self, other): """Check equality.""" if hasattr(other, "_as_tuple"): return self._as_tuple() == other._as_tuple() if isinstance(other, tuple): return self._as_tuple() == other return NotImplemented class DirState: """Record directory and metadata state for fast access. A dirstate is a specialised data structure for managing local working tree state information. Its not yet well defined whether it is platform specific, and if it is how we detect/parameterize that. Dirstates use the usual lock_write, lock_read and unlock mechanisms. Unlike most bzr disk formats, DirStates must be locked for reading, using lock_read. (This is an os file lock internally.) This is necessary because the file can be rewritten in place. DirStates must be explicitly written with save() to commit changes; just unlocking them does not write the changes to disk. """ _kind_to_minikind = { "absent": b"a", "file": b"f", "directory": b"d", "relocated": b"r", "symlink": b"l", "tree-reference": b"t", } _minikind_to_kind = { b"a": "absent", b"f": "file", b"d": "directory", b"l": "symlink", b"r": "relocated", b"t": "tree-reference", } _stat_to_minikind = { stat.S_IFDIR: b"d", stat.S_IFREG: b"f", stat.S_IFLNK: b"l", } _to_yesno = {True: b"y", False: b"n"} # TODO profile the performance gain # of using int conversion rather than a dict here. AND BLAME ANDREW IF # it is faster. # TODO: jam 20070221 Figure out what to do if we have a record that exceeds # the BISECT_PAGE_SIZE. For now, we just have to make it large enough # that we are sure a single record will always fit. BISECT_PAGE_SIZE = 4096 NOT_IN_MEMORY = 0 IN_MEMORY_UNMODIFIED = 1 IN_MEMORY_MODIFIED = 2 IN_MEMORY_HASH_MODIFIED = 3 # Only hash-cache updates # A pack_stat (the x's) that is just noise and will never match the output # of base64 encode. NULLSTAT = b"x" * 32 NULL_PARENT_DETAILS = (b"a", b"", 0, False, b"") HEADER_FORMAT_2 = b"#bazaar dirstate flat format 2\n" HEADER_FORMAT_3 = b"#bazaar dirstate flat format 3\n" def __init__( self, path, sha1_provider, worth_saving_limit=0, use_filesystem_for_exec=True, fdatasync=False, ): """Create a DirState object. :param path: The path at which the dirstate file on disk should live. :param sha1_provider: an object meeting the SHA1Provider interface. :param worth_saving_limit: when the exact number of hash changed entries is known, only bother saving the dirstate if more than this count of entries have changed. -1 means never save hash changes, 0 means always save hash changes. :param use_filesystem_for_exec: Whether to trust the filesystem for executable bit information """ # _header_state and _dirblock_state represent the current state # of the dirstate metadata and the per-row data respectiely. # NOT_IN_MEMORY indicates that no data is in memory # IN_MEMORY_UNMODIFIED indicates that what we have in memory # is the same as is on disk # IN_MEMORY_MODIFIED indicates that we have a modified version # of what is on disk. # In future we will add more granularity, for instance _dirblock_state # will probably support partially-in-memory as a separate variable, # allowing for partially-in-memory unmodified and partially-in-memory # modified states. self._header_state = DirState.NOT_IN_MEMORY self._dirblock_state = DirState.NOT_IN_MEMORY # If true, an error has been detected while updating the dirstate, and # for safety we're not going to commit to disk. self._changes_aborted = False self._dirblocks = [] self._ghosts = [] self._parents = [] self._state_file = None self._filename = path self._lock_token = None self._lock_state = None self._id_index = None # a map from packed_stat to sha's. self._packed_stat_index = None self._end_of_header = None self._cutoff_time = None self._split_path_cache = {} self._bisect_page_size = DirState.BISECT_PAGE_SIZE self._sha1_provider = sha1_provider self._sha1_file = self._sha1_provider.sha1 # These two attributes provide a simple cache for lookups into the # dirstate in-memory vectors. By probing respectively for the last # block, and for the next entry, we save nearly 2 bisections per path # during commit. self._last_block_index = None self._last_entry_index = None # The set of known hash changes self._known_hash_changes = set() # How many hash changed entries can we have without saving self._worth_saving_limit = worth_saving_limit self._fdatasync = fdatasync self._use_filesystem_for_exec = use_filesystem_for_exec def __repr__(self): """Return string representation of the dirstate.""" return f"{self.__class__.__name__}({self._filename!r})" def _mark_modified(self, hash_changed_entries=None, header_modified=False): """Mark this dirstate as modified. :param hash_changed_entries: if non-None, mark just these entries as having their hash modified. :param header_modified: mark the header modified as well, not just the dirblocks. """ # logger.debug_callsite(3, "modified hash entries: %s", hash_changed_entries) if hash_changed_entries: self._known_hash_changes.update([e[0] for e in hash_changed_entries]) if self._dirblock_state in ( DirState.NOT_IN_MEMORY, DirState.IN_MEMORY_UNMODIFIED, ): # If the dirstate is already marked a IN_MEMORY_MODIFIED, then # that takes precedence. self._dirblock_state = DirState.IN_MEMORY_HASH_MODIFIED else: # TODO: Since we now have a IN_MEMORY_HASH_MODIFIED state, we # should fail noisily if someone tries to set # IN_MEMORY_MODIFIED but we don't have a write-lock! # We don't know exactly what changed so disable smart saving self._dirblock_state = DirState.IN_MEMORY_MODIFIED if header_modified: self._header_state = DirState.IN_MEMORY_MODIFIED def _mark_unmodified(self): """Mark this dirstate as unmodified.""" self._header_state = DirState.IN_MEMORY_UNMODIFIED self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED self._known_hash_changes = set() def add(self, path, file_id, kind, stat, fingerprint): """Add a path to be tracked. :param path: The path within the dirstate - b'' is the root, 'foo' is the path foo within the root, 'foo/bar' is the path bar within foo within the root. :param file_id: The file id of the path being added. :param kind: The kind of the path, as a string like 'file', 'directory', etc. :param stat: The output of os.lstat for the path. :param fingerprint: The sha value of the file's canonical form (i.e. after any read filters have been applied), or the target of a symlink, or the referenced revision id for tree-references, or b'' for directories. """ # adding a file: # find the block its in. # find the location in the block. # check its not there # add it. # ------- copied from inventory.ensure_normalized_name - keep synced. # --- normalized_filename wants a unicode basename only, so get one. dirname, basename = osutils.split(path) # we dont import normalized_filename directly because we want to be # able to change the implementation at runtime for tests. norm_name, can_access = osutils.normalized_filename(basename) if norm_name != basename: if can_access: basename = norm_name else: raise InvalidNormalization(path) # you should never have files called . or ..; just add the directory # in the parent, or according to the special treatment for the root if basename == "." or basename == "..": raise inventory.InvalidEntryName(path) # now that we've normalised, we need the correct utf8 path and # dirname and basename elements. This single encode and split should be # faster than three separate encodes. utf8path = (dirname + "/" + basename).strip("/").encode("utf8") dirname, basename = osutils.split(utf8path) # uses __class__ for speed; the check is needed for safety if file_id.__class__ is not bytes: raise AssertionError(f"must be a utf8 file_id not {type(file_id)}") # Make sure the file_id does not exist in this tree rename_from = None file_id_entry = self._get_entry(0, fileid_utf8=file_id, include_deleted=True) if file_id_entry != (None, None): if file_id_entry[1][0][0] == b"a": if file_id_entry[0] != (dirname, basename, file_id): # set the old name's current operation to rename self.update_minimal( file_id_entry[0], b"r", path_utf8=b"", packed_stat=b"", fingerprint=utf8path, ) rename_from = file_id_entry[0][0:2] else: path = osutils.pathjoin(file_id_entry[0][0], file_id_entry[0][1]) kind = DirState._minikind_to_kind[file_id_entry[1][0][0]] info = f"{kind}:{path}" raise inventory.DuplicateFileId(file_id, info) first_key = (dirname, basename, b"") block_index, present = self._find_block_index_from_key(first_key) if present: # check the path is not in the tree block = self._dirblocks[block_index][1] entry_index, _ = self._find_entry_index(first_key, block) while ( entry_index < len(block) and block[entry_index][0][0:2] == first_key[0:2] ): if block[entry_index][1][0][0] not in (b"a", b"r"): # this path is in the dirstate in the current tree. raise Exception("adding already added path!") entry_index += 1 else: # The block where we want to put the file is not present. But it # might be because the directory was empty, or not loaded yet. Look # for a parent entry, if not found, raise NotVersionedError parent_dir, parent_base = osutils.split(dirname) ( parent_block_idx, parent_entry_idx, _, parent_present, ) = self._get_block_entry_index(parent_dir, parent_base, 0) if not parent_present: raise NotVersionedError(path, str(self)) self._ensure_block(parent_block_idx, parent_entry_idx, dirname) block = self._dirblocks[block_index][1] entry_key = (dirname, basename, file_id) if stat is None: size = 0 packed_stat = DirState.NULLSTAT else: size = stat.st_size packed_stat = pack_stat(stat) parent_info = self._empty_parent_info() minikind = DirState._kind_to_minikind[kind] if rename_from is not None: old_path_utf8 = b"%s/%s" % rename_from if rename_from[0] else rename_from[1] parent_info[0] = (b"r", old_path_utf8, 0, False, b"") if kind == "file": entry_data = ( entry_key, [ (minikind, fingerprint, size, False, packed_stat), ] + parent_info, ) elif kind == "directory": entry_data = ( entry_key, [ (minikind, b"", 0, False, packed_stat), ] + parent_info, ) elif kind == "symlink": entry_data = ( entry_key, [ (minikind, fingerprint, size, False, packed_stat), ] + parent_info, ) elif kind == "tree-reference": entry_data = ( entry_key, [ (minikind, fingerprint, 0, False, packed_stat), ] + parent_info, ) else: raise BzrFormatsError(f"unknown kind {kind!r}") entry_index, present = self._find_entry_index(entry_key, block) if not present: block.insert(entry_index, entry_data) else: if block[entry_index][1][0][0] != b"a": raise AssertionError(f" {basename!r}({file_id!r}) already added") block[entry_index][1][0] = entry_data[1][0] if kind == "directory": # insert a new dirblock self._ensure_block(block_index, entry_index, utf8path) self._mark_modified() if self._id_index: self._id_index.add(entry_key) def _bisect(self, paths): """Bisect through the disk structure for specific rows. :param paths: A list of paths to find :return: A dict mapping path => entries for found entries. Missing entries will not be in the map. The list is not sorted, and entries will be populated based on when they were read. """ self._requires_lock() # We need the file pointer to be right after the initial header block self._read_header_if_needed() # If _dirblock_state was in memory, we should just return info from # there, this function is only meant to handle when we want to read # part of the disk. if self._dirblock_state != DirState.NOT_IN_MEMORY: raise AssertionError(f"bad dirblock state {self._dirblock_state!r}") # The disk representation is generally info + '\0\n\0' at the end. But # for bisecting, it is easier to treat this as '\0' + info + '\0\n' # Because it means we can sync on the '\n' state_file = self._state_file file_size = os.fstat(state_file.fileno()).st_size # We end up with 2 extra fields, we should have a trailing '\n' to # ensure that we read the whole record, and we should have a precursur # b'' which ensures that we start after the previous '\n' entry_field_count = _fields_per_entry(self._num_present_parents()) + 1 low = self._end_of_header high = file_size - 1 # Ignore the final '\0' # Map from (dir, name) => entry found = {} # Avoid infinite seeking max_count = 30 * len(paths) count = 0 # pending is a list of places to look. # each entry is a tuple of low, high, dir_names # low -> the first byte offset to read (inclusive) # high -> the last byte offset (inclusive) # dir_names -> The list of (dir, name) pairs that should be found in # the [low, high] range pending = [(low, high, paths)] page_size = self._bisect_page_size fields_to_entry = self._get_fields_to_entry() while pending: low, high, cur_files = pending.pop() if not cur_files or low >= high: # Nothing to find continue count += 1 if count > max_count: raise BzrFormatsError("Too many seeks, most likely a bug.") mid = max(low, (low + high - page_size) // 2) state_file.seek(mid) # limit the read size, so we don't end up reading data that we have # already read. read_size = min(page_size, (high - mid) + 1) block = state_file.read(read_size) start = mid entries = block.split(b"\n") if len(entries) < 2: # We didn't find a '\n', so we cannot have found any records. # So put this range back and try again. But we know we have to # increase the page size, because a single read did not contain # a record break (so records must be larger than page_size) page_size *= 2 pending.append((low, high, cur_files)) continue # Check the first and last entries, in case they are partial, or if # we don't care about the rest of this page first_entry_num = 0 first_fields = entries[0].split(b"\0") if len(first_fields) < entry_field_count: # We didn't get the complete first entry # so move start, and grab the next, which # should be a full entry start += len(entries[0]) + 1 first_fields = entries[1].split(b"\0") first_entry_num = 1 if len(first_fields) <= 2: # We didn't even get a filename here... what do we do? # Try a large page size and repeat this query page_size *= 2 pending.append((low, high, cur_files)) continue else: # Find what entries we are looking for, which occur before and # after this first record. after = start if first_fields[1]: first_path = first_fields[1] + b"/" + first_fields[2] else: first_path = first_fields[2] first_loc = bisect_path_left(cur_files, first_path) # These exist before the current location pre = cur_files[:first_loc] # These occur after the current location, which may be in the # data we read, or might be after the last entry post = cur_files[first_loc:] if post and len(first_fields) >= entry_field_count: # We have files after the first entry # Parse the last entry last_entry_num = len(entries) - 1 last_fields = entries[last_entry_num].split(b"\0") if len(last_fields) < entry_field_count: # The very last hunk was not complete, # read the previous hunk after = mid + len(block) - len(entries[-1]) last_entry_num -= 1 last_fields = entries[last_entry_num].split(b"\0") else: after = mid + len(block) if last_fields[1]: last_path = last_fields[1] + b"/" + last_fields[2] else: last_path = last_fields[2] last_loc = bisect_path_right(post, last_path) middle_files = post[:last_loc] post = post[last_loc:] if middle_files: # We have files that should occur in this block # (>= first, <= last) # Either we will find them here, or we can mark them as # missing. if middle_files[0] == first_path: # We might need to go before this location pre.append(first_path) if middle_files[-1] == last_path: post.insert(0, last_path) # Find out what paths we have paths = {first_path: [first_fields]} # last_path might == first_path so we need to be # careful if we should append rather than overwrite if last_entry_num != first_entry_num: paths.setdefault(last_path, []).append(last_fields) for num in range(first_entry_num + 1, last_entry_num): # TODO: jam 20070223 We are already splitting here, so # shouldn't we just split the whole thing rather # than doing the split again in add_one_record? fields = entries[num].split(b"\0") path = fields[1] + b"/" + fields[2] if fields[1] else fields[2] paths.setdefault(path, []).append(fields) for path in middle_files: for fields in paths.get(path, []): # offset by 1 because of the opening '\0' # consider changing fields_to_entry to avoid the # extra list slice entry = fields_to_entry(fields[1:]) found.setdefault(path, []).append(entry) # Now we have split up everything into pre, middle, and post, and # we have handled everything that fell in 'middle'. # We add 'post' first, so that we prefer to seek towards the # beginning, so that we will tend to go as early as we need, and # then only seek forward after that. if post: pending.append((after, high, post)) if pre: pending.append((low, start - 1, pre)) # Consider that we may want to return the directory entries in sorted # order. For now, we just return them in whatever order we found them, # and leave it up to the caller if they care if it is ordered or not. return found def _bisect_dirblocks(self, dir_list): """Bisect through the disk structure to find entries in given dirs. _bisect_dirblocks is meant to find the contents of directories, which differs from _bisect, which only finds individual entries. :param dir_list: A sorted list of directory names ['', 'dir', 'foo']. :return: A map from dir => entries_for_dir """ # TODO: jam 20070223 A lot of the bisecting logic could be shared # between this and _bisect. It would require parameterizing the # inner loop with a function, though. We should evaluate the # performance difference. self._requires_lock() # We need the file pointer to be right after the initial header block self._read_header_if_needed() # If _dirblock_state was in memory, we should just return info from # there, this function is only meant to handle when we want to read # part of the disk. if self._dirblock_state != DirState.NOT_IN_MEMORY: raise AssertionError(f"bad dirblock state {self._dirblock_state!r}") # The disk representation is generally info + '\0\n\0' at the end. But # for bisecting, it is easier to treat this as '\0' + info + '\0\n' # Because it means we can sync on the '\n' state_file = self._state_file file_size = os.fstat(state_file.fileno()).st_size # We end up with 2 extra fields, we should have a trailing '\n' to # ensure that we read the whole record, and we should have a precursur # b'' which ensures that we start after the previous '\n' entry_field_count = _fields_per_entry(self._num_present_parents()) + 1 low = self._end_of_header high = file_size - 1 # Ignore the final '\0' # Map from dir => entry found = {} # Avoid infinite seeking max_count = 30 * len(dir_list) count = 0 # pending is a list of places to look. # each entry is a tuple of low, high, dir_names # low -> the first byte offset to read (inclusive) # high -> the last byte offset (inclusive) # dirs -> The list of directories that should be found in # the [low, high] range pending = [(low, high, dir_list)] page_size = self._bisect_page_size fields_to_entry = self._get_fields_to_entry() while pending: low, high, cur_dirs = pending.pop() if not cur_dirs or low >= high: # Nothing to find continue count += 1 if count > max_count: raise BzrFormatsError("Too many seeks, most likely a bug.") mid = max(low, (low + high - page_size) // 2) state_file.seek(mid) # limit the read size, so we don't end up reading data that we have # already read. read_size = min(page_size, (high - mid) + 1) block = state_file.read(read_size) start = mid entries = block.split(b"\n") if len(entries) < 2: # We didn't find a '\n', so we cannot have found any records. # So put this range back and try again. But we know we have to # increase the page size, because a single read did not contain # a record break (so records must be larger than page_size) page_size *= 2 pending.append((low, high, cur_dirs)) continue # Check the first and last entries, in case they are partial, or if # we don't care about the rest of this page first_entry_num = 0 first_fields = entries[0].split(b"\0") if len(first_fields) < entry_field_count: # We didn't get the complete first entry # so move start, and grab the next, which # should be a full entry start += len(entries[0]) + 1 first_fields = entries[1].split(b"\0") first_entry_num = 1 if len(first_fields) <= 1: # We didn't even get a dirname here... what do we do? # Try a large page size and repeat this query page_size *= 2 pending.append((low, high, cur_dirs)) continue else: # Find what entries we are looking for, which occur before and # after this first record. after = start first_dir = first_fields[1] first_loc = bisect.bisect_left(cur_dirs, first_dir) # These exist before the current location pre = cur_dirs[:first_loc] # These occur after the current location, which may be in the # data we read, or might be after the last entry post = cur_dirs[first_loc:] if post and len(first_fields) >= entry_field_count: # We have records to look at after the first entry # Parse the last entry last_entry_num = len(entries) - 1 last_fields = entries[last_entry_num].split(b"\0") if len(last_fields) < entry_field_count: # The very last hunk was not complete, # read the previous hunk after = mid + len(block) - len(entries[-1]) last_entry_num -= 1 last_fields = entries[last_entry_num].split(b"\0") else: after = mid + len(block) last_dir = last_fields[1] last_loc = bisect.bisect_right(post, last_dir) middle_files = post[:last_loc] post = post[last_loc:] if middle_files: # We have files that should occur in this block # (>= first, <= last) # Either we will find them here, or we can mark them as # missing. if middle_files[0] == first_dir: # We might need to go before this location pre.append(first_dir) if middle_files[-1] == last_dir: post.insert(0, last_dir) # Find out what paths we have paths = {first_dir: [first_fields]} # last_dir might == first_dir so we need to be # careful if we should append rather than overwrite if last_entry_num != first_entry_num: paths.setdefault(last_dir, []).append(last_fields) for num in range(first_entry_num + 1, last_entry_num): # TODO: jam 20070223 We are already splitting here, so # shouldn't we just split the whole thing rather # than doing the split again in add_one_record? fields = entries[num].split(b"\0") paths.setdefault(fields[1], []).append(fields) for cur_dir in middle_files: for fields in paths.get(cur_dir, []): # offset by 1 because of the opening '\0' # consider changing fields_to_entry to avoid the # extra list slice entry = fields_to_entry(fields[1:]) found.setdefault(cur_dir, []).append(entry) # Now we have split up everything into pre, middle, and post, and # we have handled everything that fell in 'middle'. # We add 'post' first, so that we prefer to seek towards the # beginning, so that we will tend to go as early as we need, and # then only seek forward after that. if post: pending.append((after, high, post)) if pre: pending.append((low, start - 1, pre)) return found def _bisect_recursive(self, paths): """Bisect for entries for all paths and their children. This will use bisect to find all records for the supplied paths. It will then continue to bisect for any records which are marked as directories. (and renames?) :param paths: A sorted list of (dir, name) pairs eg: [('', b'a'), ('', b'f'), ('a/b', b'c')] :return: A dictionary mapping (dir, name, file_id) => [tree_info] """ # Map from (dir, name, file_id) => [tree_info] found = {} found_dir_names = set() # Directories that have been read processed_dirs = set() # Get the ball rolling with the first bisect for all entries. newly_found = self._bisect(paths) while newly_found: # Directories that need to be read pending_dirs = set() paths_to_search = set() for entry_list in newly_found.values(): for dir_name_id, trees_info in entry_list: found[dir_name_id] = trees_info found_dir_names.add(dir_name_id[:2]) is_dir = False for tree_info in trees_info: minikind = tree_info[0] if minikind == b"d": if is_dir: # We already processed this one as a directory, # we don't need to do the extra work again. continue subdir, name, _file_id = dir_name_id path = osutils.pathjoin(subdir, name) is_dir = True if path not in processed_dirs: pending_dirs.add(path) elif minikind == b"r": # Rename, we need to directly search the target # which is contained in the fingerprint column dir_name = osutils.split(tree_info[1]) if dir_name[0] in pending_dirs: # This entry will be found in the dir search continue if dir_name not in found_dir_names: paths_to_search.add(tree_info[1]) # Now we have a list of paths to look for directly, and # directory blocks that need to be read. # newly_found is mixing the keys between (dir, name) and path # entries, but that is okay, because we only really care about the # targets. newly_found = self._bisect(sorted(paths_to_search)) newly_found.update(self._bisect_dirblocks(sorted(pending_dirs))) processed_dirs.update(pending_dirs) return found def _discard_merge_parents(self): """Discard any parents trees beyond the first. Note that if this fails the dirstate is corrupted. After this function returns the dirstate contains 2 trees, neither of which are ghosted. """ self._read_header_if_needed() parents = self.get_parent_ids() if len(parents) < 1: return # only require all dirblocks if we are doing a full-pass removal. self._read_dirblocks_if_needed() dead_patterns = {(b"a", b"r"), (b"a", b"a"), (b"r", b"r"), (b"r", b"a")} def iter_entries_removable(): for block in self._dirblocks: deleted_positions = [] for pos, entry in enumerate(block[1]): yield entry if (entry[1][0][0], entry[1][1][0]) in dead_patterns: deleted_positions.append(pos) if deleted_positions: if len(deleted_positions) == len(block[1]): del block[1][:] else: for pos in reversed(deleted_positions): del block[1][pos] # if the first parent is a ghost: if parents[0] in self.get_ghosts(): empty_parent = [DirState.NULL_PARENT_DETAILS] for entry in iter_entries_removable(): entry[1][1:] = empty_parent else: for entry in iter_entries_removable(): del entry[1][2:] self._ghosts = [] self._parents = [parents[0]] self._mark_modified(header_modified=True) def _empty_parent_info(self): return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) - len(self._ghosts)) def _ensure_block(self, parent_block_index, parent_row_index, dirname): """Ensure a block for dirname exists. This function exists to let callers which know that there is a directory dirname ensure that the block for it exists. This block can fail to exist because of demand loading, or because a directory had no children. In either case it is not an error. It is however an error to call this if there is no parent entry for the directory, and thus the function requires the coordinates of such an entry to be provided. The root row is special cased and can be indicated with a parent block and row index of -1 :param parent_block_index: The index of the block in which dirname's row exists. :param parent_row_index: The index in the parent block where the row exists. :param dirname: The utf8 dirname to ensure there is a block for. :return: The index for the block. """ if dirname == b"" and parent_row_index == 0 and parent_block_index == 0: # This is the signature of the root row, and the # contents-of-root row is always index 1 return 1 # the basename of the directory must be the end of its full name. if not ( parent_block_index == -1 and parent_block_index == -1 and dirname == b"" ) and not dirname.endswith( self._dirblocks[parent_block_index][1][parent_row_index][0][1] ): raise AssertionError(f"bad dirname {dirname!r}") block_index, present = self._find_block_index_from_key((dirname, b"", b"")) if not present: # In future, when doing partial parsing, this should load and # populate the entire block. self._dirblocks.insert(block_index, (dirname, [])) return block_index def _entries_to_current_state(self, new_entries): """Load new_entries into self.dirblocks. Process new_entries into the current state object, making them the active state. The entries are grouped together by directory to form dirblocks. :param new_entries: A sorted list of entries. This function does not sort to prevent unneeded overhead when callers have a sorted list already. :return: Nothing. """ if new_entries[0][0][0:2] != (b"", b""): raise AssertionError(f"Missing root row {new_entries[0][0]!r}") # The two blocks here are deliberate: the root block and the # contents-of-root block. self._dirblocks = [(b"", []), (b"", [])] current_block = self._dirblocks[0][1] current_dirname = b"" append_entry = current_block.append for entry in new_entries: if entry[0][0] != current_dirname: # new block - different dirname current_block = [] current_dirname = entry[0][0] self._dirblocks.append((current_dirname, current_block)) append_entry = current_block.append # append the entry to the current block append_entry(entry) self._split_root_dirblock_into_contents() def _split_root_dirblock_into_contents(self): """Split the root dirblocks into root and contents-of-root. After parsing by path, we end up with root entries and contents-of-root entries in the same block. This loop splits them out again. """ # The above loop leaves the "root block" entries mixed with the # "contents-of-root block". But we don't want an if check on # all entries, so instead we just fix it up here. if self._dirblocks[1] != (b"", []): raise ValueError(f"bad dirblock start {self._dirblocks[1]!r}") root_block = [] contents_of_root_block = [] for entry in self._dirblocks[0][1]: if not entry[0][1]: # This is a root entry root_block.append(entry) else: contents_of_root_block.append(entry) self._dirblocks[0] = (b"", root_block) self._dirblocks[1] = (b"", contents_of_root_block) def _entries_for_path(self, path): """Return a list with all the entries that match path for all ids.""" dirname, basename = os.path.split(path) key = (dirname, basename, b"") block_index, present = self._find_block_index_from_key(key) if not present: # the block which should contain path is absent. return [] result = [] block = self._dirblocks[block_index][1] entry_index, _ = self._find_entry_index(key, block) # we may need to look at multiple entries at this path: walk while the specific_files match. while entry_index < len(block) and block[entry_index][0][0:2] == key[0:2]: result.append(block[entry_index]) entry_index += 1 return result @staticmethod def _entry_to_line(entry): """Serialize entry to a NULL delimited line ready for _get_output_lines. :param entry: An entry_tuple as defined in the module docstring. """ entire_entry = list(entry[0]) for tree_number, tree_data in enumerate(entry[1]): # (minikind, fingerprint, size, executable, tree_specific_string) entire_entry.extend(tree_data) # 3 for the key, 5 for the fields per tree. tree_offset = 3 + tree_number * 5 # minikind entire_entry[tree_offset + 0] = tree_data[0] # size entire_entry[tree_offset + 2] = b"%d" % tree_data[2] # executable entire_entry[tree_offset + 3] = DirState._to_yesno[tree_data[3]] return b"\0".join(entire_entry) def _find_block(self, key, add_if_missing=False): """Return the block that key should be present in. :param key: A dirstate entry key. :return: The block tuple. """ block_index, present = self._find_block_index_from_key(key) if not present: if not add_if_missing: # check to see if key is versioned itself - we might want to # add it anyway, because dirs with no entries dont get a # dirblock at parse time. # This is an uncommon branch to take: most dirs have children, # and most code works with versioned paths. parent_base, parent_name = osutils.split(key[0]) if not self._get_block_entry_index(parent_base, parent_name, 0)[3]: # some parent path has not been added - its an error to add # this child raise NotVersionedError(key[0:2], str(self)) self._dirblocks.insert(block_index, (key[0], [])) return self._dirblocks[block_index] def _find_block_index_from_key(self, key): """Find the dirblock index for a key. :return: The block index, True if the block for the key is present. """ if key[0:2] == (b"", b""): return 0, True try: if ( self._last_block_index is not None and self._dirblocks[self._last_block_index][0] == key[0] ): return self._last_block_index, True except IndexError: pass block_index = bisect_dirblock( self._dirblocks, key[0], 1, cache=self._split_path_cache ) # _right returns one-past-where-key is so we have to subtract # one to use it. we use _right here because there are two # b'' blocks - the root, and the contents of root # we always have a minimum of 2 in self._dirblocks: root and # root-contents, and for b'', we get 2 back, so this is # simple and correct: present = ( block_index < len(self._dirblocks) and self._dirblocks[block_index][0] == key[0] ) self._last_block_index = block_index # Reset the entry index cache to the beginning of the block. self._last_entry_index = -1 return block_index, present def _find_entry_index(self, key, block): """Find the entry index for a key in a block. :return: The entry index, True if the entry for the key is present. """ len_block = len(block) try: if self._last_entry_index is not None: # mini-bisect here. entry_index = self._last_entry_index + 1 # A hit is when the key is after the last slot, and before or # equal to the next slot. if ( entry_index > 0 and block[entry_index - 1][0] < key ) and key <= block[entry_index][0]: self._last_entry_index = entry_index present = block[entry_index][0] == key return entry_index, present except IndexError: pass entry_index = bisect.bisect_left(block, (key, [])) present = entry_index < len_block and block[entry_index][0] == key self._last_entry_index = entry_index return entry_index, present @staticmethod def from_tree(tree, dir_state_filename, sha1_provider=None): """Create a dirstate from a bzr Tree. :param tree: The tree which should provide parent information and inventory ids. :param sha1_provider: an object meeting the SHA1Provider interface. If None, a DefaultSHA1Provider is used. :return: a DirState object which is currently locked for writing. (it was locked by DirState.initialize) """ result = DirState.initialize(dir_state_filename, sha1_provider=sha1_provider) try: with contextlib.ExitStack() as exit_stack: exit_stack.enter_context(tree.lock_read()) parent_ids = tree.get_parent_ids() len(parent_ids) parent_trees = [] for parent_id in parent_ids: parent_tree = tree.branch.repository.revision_tree(parent_id) parent_trees.append((parent_id, parent_tree)) exit_stack.enter_context(parent_tree.lock_read()) result.set_parent_trees(parent_trees, []) result.set_state_from_inventory(tree.root_inventory) except: # The caller won't have a chance to unlock this, so make sure we # cleanup ourselves result.unlock() raise return result def update_by_delta(self, delta): """Apply an inventory delta to the dirstate for tree 0. This is the workhorse for apply_inventory_delta in dirstate based trees. :param delta: An inventory delta. See Inventory.apply_delta for details. """ self._read_dirblocks_if_needed() insertions = {} removals = {} # Accumulate parent references (path_utf8, id), to check for parentless # items or items placed under files/links/tree-references. We get # references from every item in the delta that is not a deletion and # is not itself the root. parents = set() # Added ids must not be in the dirstate already. This set holds those # ids. new_ids = set() # This loop transforms the delta to single atomic operations that can # be executed and validated. delta.check() delta.sort() for old_path, new_path, file_id, inv_entry in delta: if not isinstance(file_id, bytes): raise AssertionError(f"must be a utf8 file_id not {type(file_id)}") if (file_id in insertions) or (file_id in removals): self._raise_invalid(old_path or new_path, file_id, "repeated file_id") if old_path is not None: old_path = old_path.encode("utf-8") removals[file_id] = old_path else: new_ids.add(file_id) if new_path is not None: if inv_entry is None: self._raise_invalid(new_path, file_id, "new_path with no entry") new_path = new_path.encode("utf-8") dirname_utf8, basename = osutils.split(new_path) if basename: parents.add((dirname_utf8, inv_entry.parent_id)) key = (dirname_utf8, basename, file_id) minikind = DirState._kind_to_minikind[inv_entry.kind] if minikind == b"t": fingerprint = inv_entry.reference_revision or b"" else: fingerprint = b"" insertions[file_id] = ( key, minikind, inv_entry.executable, fingerprint, new_path, ) # Transform moves into delete+add pairs if None not in (old_path, new_path): for child in self._iter_child_entries(0, old_path): if child[0][2] in insertions or child[0][2] in removals: continue child_dirname = child[0][0] child_basename = child[0][1] minikind = child[1][0][0] fingerprint = child[1][0][4] executable = child[1][0][3] old_child_path = osutils.pathjoin(child_dirname, child_basename) removals[child[0][2]] = old_child_path child_suffix = child_dirname[len(old_path) :] new_child_dirname = new_path + child_suffix key = (new_child_dirname, child_basename, child[0][2]) new_child_path = osutils.pathjoin(new_child_dirname, child_basename) insertions[child[0][2]] = ( key, minikind, executable, fingerprint, new_child_path, ) self._check_delta_ids_absent(new_ids, 0) try: self._apply_removals(removals.items()) self._apply_insertions(insertions.values()) # Validate parents self._after_delta_check_parents(parents, 0) except BzrFormatsError as e: self._changes_aborted = True if "integrity error" not in str(e): raise # _get_entry raises BzrError when a request is inconsistent; we # want such errors to be shown as InconsistentDelta - and that # fits the behaviour we trigger. raise InconsistentDeltaDelta(delta, f"error from _get_entry. {e}") from e def _apply_removals(self, removals): for file_id, path in sorted(removals, reverse=True, key=operator.itemgetter(1)): dirname, basename = osutils.split(path) block_i, entry_i, d_present, f_present = self._get_block_entry_index( dirname, basename, 0 ) try: entry = self._dirblocks[block_i][1][entry_i] except IndexError: self._raise_invalid(path, file_id, "Wrong path for old path.") if not f_present or entry[1][0][0] in (b"a", b"r"): self._raise_invalid(path, file_id, "Wrong path for old path.") if file_id != entry[0][2]: self._raise_invalid( path, file_id, "Attempt to remove path has wrong id - found {!r}.".format( entry[0][2] ), ) self._make_absent(entry) # See if we have a malformed delta: deleting a directory must not # leave crud behind. This increases the number of bisects needed # substantially, but deletion or renames of large numbers of paths # is rare enough it shouldn't be an issue (famous last words?) RBC # 20080730. block_i, entry_i, d_present, f_present = self._get_block_entry_index( path, b"", 0 ) if d_present: # The dir block is still present in the dirstate; this could # be due to it being in a parent tree, or a corrupt delta. for child_entry in self._dirblocks[block_i][1]: if child_entry[1][0][0] not in (b"r", b"a"): self._raise_invalid( path, entry[0][2], "The file id was deleted but its children were " "not deleted.", ) def _apply_insertions(self, adds): try: for key, minikind, executable, fingerprint, path_utf8 in sorted(adds): self.update_minimal( key, minikind, executable, fingerprint, path_utf8=path_utf8 ) except NotVersionedError: self._raise_invalid(path_utf8.decode("utf8"), key[2], "Missing parent") def update_basis_by_delta(self, delta, new_revid): """Update the parents of this tree after a commit. This gives the tree one parent, with revision id new_revid. The inventory delta is applied to the current basis tree to generate the inventory for the parent new_revid, and all other parent trees are discarded. Note that an exception during the operation of this method will leave the dirstate in a corrupt state where it should not be saved. :param new_revid: The new revision id for the trees parent. :param delta: An inventory delta (see apply_inventory_delta) describing the changes from the current left most parent revision to new_revid. """ self._read_dirblocks_if_needed() self._discard_merge_parents() if self._ghosts != []: raise NotImplementedError(self.update_basis_by_delta) if len(self._parents) == 0: # setup a blank tree, the most simple way. empty_parent = DirState.NULL_PARENT_DETAILS for entry in self._iter_entries(): entry[1].append(empty_parent) self._parents.append(new_revid) self._parents[0] = new_revid delta.check() delta.sort() adds = [] changes = [] deletes = [] # The paths this function accepts are unicode and must be encoded as we # go. inv_to_entry = _inv_entry_to_details # delta is now (deletes, changes), (adds) in reverse lexographical # order. # deletes in reverse lexographic order are safe to process in situ. # renames are not, as a rename from any path could go to a path # lexographically lower, so we transform renames into delete, add pairs, # expanding them recursively as needed. # At the same time, to reduce interface friction we convert the input # inventory entries to dirstate. root_only = ("", "") # Accumulate parent references (path_utf8, id), to check for parentless # items or items placed under files/links/tree-references. We get # references from every item in the delta that is not a deletion and # is not itself the root. parents = set() # Added ids must not be in the dirstate already. This set holds those # ids. new_ids = set() for old_path, new_path, file_id, inv_entry in delta: if file_id.__class__ is not bytes: raise AssertionError(f"must be a utf8 file_id not {type(file_id)}") if inv_entry is not None and file_id != inv_entry.file_id: self._raise_invalid( new_path, file_id, f"mismatched entry file_id {inv_entry!r}" ) if new_path is None: new_path_utf8 = None else: if inv_entry is None: self._raise_invalid(new_path, file_id, "new_path with no entry") new_path_utf8 = new_path.encode("utf-8") # note the parent for validation dirname_utf8, basename_utf8 = osutils.split(new_path_utf8) if basename_utf8: parents.add((dirname_utf8, inv_entry.parent_id)) old_path_utf8 = None if old_path is None else old_path.encode("utf-8") if old_path is None: adds.append( (None, new_path_utf8, file_id, inv_to_entry(inv_entry), True) ) new_ids.add(file_id) elif new_path is None: deletes.append((old_path_utf8, None, file_id, None, True)) elif (old_path, new_path) == root_only: # change things in-place # Note: the case of a parent directory changing its file_id # tends to break optimizations here, because officially # the file has actually been moved, it just happens to # end up at the same path. If we can figure out how to # handle that case, we can avoid a lot of add+delete # pairs for objects that stay put. # elif old_path == new_path: changes.append( (old_path_utf8, new_path_utf8, file_id, inv_to_entry(inv_entry)) ) else: # Renames: # Because renames must preserve their children we must have # processed all relocations and removes before hand. The sort # order ensures we've examined the child paths, but we also # have to execute the removals, or the split to an add/delete # pair will result in the deleted item being reinserted, or # renamed items being reinserted twice - and possibly at the # wrong place. Splitting into a delete/add pair also simplifies # the handling of entries with (b'f', ...), (b'r' ...) because # the target of the b'r' is old_path here, and we add that to # deletes, meaning that the add handler does not need to check # for b'r' items on every pass. self._update_basis_apply_deletes(deletes) deletes = [] # Split into an add/delete pair recursively. adds.append( ( old_path_utf8, new_path_utf8, file_id, inv_to_entry(inv_entry), False, ) ) # Expunge deletes that we've seen so that deleted/renamed # children of a rename directory are handled correctly. new_deletes = reversed(list(self._iter_child_entries(1, old_path_utf8))) # Remove the current contents of the tree at orig_path, and # reinsert at the correct new path. for entry in new_deletes: child_dirname, child_basename, _child_file_id = entry[0] if child_dirname: source_path = child_dirname + b"/" + child_basename else: source_path = child_basename if new_path_utf8: target_path = new_path_utf8 + source_path[len(old_path_utf8) :] else: if old_path_utf8 == b"": raise AssertionError("cannot rename directory to itself") target_path = source_path[len(old_path_utf8) + 1 :] adds.append((None, target_path, entry[0][2], entry[1][1], False)) deletes.append((source_path, target_path, entry[0][2], None, False)) deletes.append((old_path_utf8, new_path_utf8, file_id, None, False)) self._check_delta_ids_absent(new_ids, 1) try: # Finish expunging deletes/first half of renames. self._update_basis_apply_deletes(deletes) # Reinstate second half of renames and new paths. self._update_basis_apply_adds(adds) # Apply in-situ changes. self._update_basis_apply_changes(changes) # Validate parents self._after_delta_check_parents(parents, 1) except BzrFormatsError as e: self._changes_aborted = True if "integrity error" not in str(e): raise # _get_entry raises BzrError when a request is inconsistent; we # want such errors to be shown as InconsistentDelta - and that # fits the behaviour we trigger. raise InconsistentDeltaDelta(delta, f"error from _get_entry. {e}") from e self._mark_modified(header_modified=True) self._id_index = None return def _check_delta_ids_absent(self, new_ids, tree_index): """Check that none of the file_ids in new_ids are present in a tree.""" if not new_ids: return id_index = self._get_id_index() for file_id in new_ids: for key in id_index.get(file_id): block_i, entry_i, _d_present, f_present = self._get_block_entry_index( key[0], key[1], tree_index ) if not f_present: # In a different tree continue entry = self._dirblocks[block_i][1][entry_i] if entry[0][2] != file_id: # Different file_id, so not what we want. continue self._raise_invalid( (b"%s/%s" % key[0:2]).decode("utf8"), file_id, "This file_id is new in the delta but already present in " "the target", ) def _raise_invalid(self, path, file_id, reason): self._changes_aborted = True raise InconsistentDelta(path, file_id, reason) def _update_basis_apply_adds(self, adds): """Apply a sequence of adds to tree 1 during update_basis_by_delta. They may be adds, or renames that have been split into add/delete pairs. :param adds: A sequence of adds. Each add is a tuple: (None, new_path_utf8, file_id, (entry_details), real_add). real_add is False when the add is the second half of a remove-and-reinsert pair created to handle renames and deletes. """ # Adds are accumulated partly from renames, so can be in any input # order - sort it. # TODO: we may want to sort in dirblocks order. That way each entry # will end up in the same directory, allowing the _get_entry # fast-path for looking up 2 items in the same dir work. adds.sort(key=lambda x: x[1]) # adds is now in lexographic order, which places all parents before # their children, so we can process it linearly. for old_path, new_path, file_id, new_details, real_add in adds: dirname, basename = osutils.split(new_path) entry_key = (dirname, basename, file_id) block_index, present = self._find_block_index_from_key(entry_key) if not present: # The block where we want to put the file is not present. # However, it might have just been an empty directory. Look for # the parent in the basis-so-far before throwing an error. parent_dir, parent_base = osutils.split(dirname) ( parent_block_idx, parent_entry_idx, _, parent_present, ) = self._get_block_entry_index(parent_dir, parent_base, 1) if not parent_present: self._raise_invalid( new_path, file_id, "Unable to find block for this record. Was the parent added?", ) self._ensure_block(parent_block_idx, parent_entry_idx, dirname) block = self._dirblocks[block_index][1] entry_index, present = self._find_entry_index(entry_key, block) if real_add and old_path is not None: self._raise_invalid( new_path, file_id, f"considered a real add but still had old_path at {old_path}", ) if present: entry = block[entry_index] basis_kind = entry[1][1][0] if basis_kind == b"a": entry[1][1] = new_details elif basis_kind == b"r": raise NotImplementedError() else: self._raise_invalid( new_path, file_id, "An entry was marked as a new add" " but the basis target already existed", ) else: # The exact key was not found in the block. However, we need to # check if there is a key next to us that would have matched. # We only need to check 2 locations, because there are only 2 # trees present. for maybe_index in range(entry_index - 1, entry_index + 1): if maybe_index < 0 or maybe_index >= len(block): continue maybe_entry = block[maybe_index] if maybe_entry[0][:2] != (dirname, basename): # Just a random neighbor continue if maybe_entry[0][2] == file_id: raise AssertionError( "_find_entry_index didnt find a key match" f" but walking the data did, for {entry_key}" ) basis_kind = maybe_entry[1][1][0] if basis_kind not in (b"a", b"r"): self._raise_invalid( new_path, file_id, "we have an add record for path, but the path" f" is already present with another file_id {maybe_entry[0][2]}", ) entry = (entry_key, [DirState.NULL_PARENT_DETAILS, new_details]) block.insert(entry_index, entry) active_kind = entry[1][0][0] if active_kind == b"a": # The active record shows up as absent, this could be genuine, # or it could be present at some other location. We need to # verify. id_index = self._get_id_index() # The id_index may not be perfectly accurate for tree1, because # we haven't been keeping it updated. However, it should be # fine for tree0, and that gives us enough info for what we # need keys = id_index.get(file_id) for key in keys: ( block_i, entry_i, _d_present, f_present, ) = self._get_block_entry_index(key[0], key[1], 0) if not f_present: continue active_entry = self._dirblocks[block_i][1][entry_i] if active_entry[0][2] != file_id: # Some other file is at this path, we don't need to # link it. continue real_active_kind = active_entry[1][0][0] if real_active_kind in (b"a", b"r"): # We found a record, which was not *this* record, # which matches the file_id, but is not actually # present. Something seems *really* wrong. self._raise_invalid( new_path, file_id, "We found a tree0 entry that doesnt make sense", ) # Now, we've found a tree0 entry which matches the file_id # but is at a different location. So update them to be # rename records. active_dir, active_name = active_entry[0][:2] if active_dir: active_path = active_dir + b"/" + active_name else: active_path = active_name active_entry[1][1] = (b"r", new_path, 0, False, b"") entry[1][0] = (b"r", active_path, 0, False, b"") elif active_kind == b"r": raise NotImplementedError() new_kind = new_details[0] if new_kind == b"d": self._ensure_block(block_index, entry_index, new_path) def _update_basis_apply_changes(self, changes): """Apply a sequence of changes to tree 1 during update_basis_by_delta. :param adds: A sequence of changes. Each change is a tuple: (path_utf8, path_utf8, file_id, (entry_details)) """ for _old_path, new_path, file_id, new_details in changes: # the entry for this file_id must be in tree 0. entry = self._get_entry(1, file_id, new_path) if entry[0] is None or entry[1][1][0] in (b"a", b"r"): self._raise_invalid( new_path, file_id, "changed entry considered not present" ) entry[1][1] = new_details def _update_basis_apply_deletes(self, deletes): """Apply a sequence of deletes to tree 1 during update_basis_by_delta. They may be deletes, or renames that have been split into add/delete pairs. :param deletes: A sequence of deletes. Each delete is a tuple: (old_path_utf8, new_path_utf8, file_id, None, real_delete). real_delete is True when the desired outcome is an actual deletion rather than the rename handling logic temporarily deleting a path during the replacement of a parent. """ null = DirState.NULL_PARENT_DETAILS for old_path, new_path, file_id, _, real_delete in deletes: if real_delete != (new_path is None): self._raise_invalid(old_path, file_id, "bad delete delta") # the entry for this file_id must be in tree 1. dirname, basename = osutils.split(old_path) ( block_index, entry_index, _dir_present, file_present, ) = self._get_block_entry_index(dirname, basename, 1) if not file_present: self._raise_invalid( old_path, file_id, "basis tree does not contain removed entry" ) entry = self._dirblocks[block_index][1][entry_index] # The state of the entry in the 'active' WT active_kind = entry[1][0][0] if entry[0][2] != file_id: self._raise_invalid(old_path, file_id, "mismatched file_id in tree 1") dir_block = () old_kind = entry[1][1][0] if active_kind in b"ar": # The active tree doesn't have this file_id. # The basis tree is changing this record. If this is a # rename, then we don't want the record here at all # anymore. If it is just an in-place change, we want the # record here, but we'll add it if we need to. So we just # delete it if active_kind == b"r": active_path = entry[1][0][1] active_entry = self._get_entry(0, file_id, active_path) if active_entry[1][1][0] != b"r": self._raise_invalid( old_path, file_id, "Dirstate did not have matching rename entries", ) elif active_entry[1][0][0] in b"ar": self._raise_invalid( old_path, file_id, "Dirstate had a rename pointing at an inactive tree0", ) active_entry[1][1] = null del self._dirblocks[block_index][1][entry_index] if old_kind == b"d": # This was a directory, and the active tree says it # doesn't exist, and now the basis tree says it doesn't # exist. Remove its dirblock if present (dir_block_index, present) = self._find_block_index_from_key( (old_path, b"", b"") ) if present: dir_block = self._dirblocks[dir_block_index][1] if not dir_block: # This entry is empty, go ahead and just remove it del self._dirblocks[dir_block_index] else: # There is still an active record, so just mark this # removed. entry[1][1] = null block_i, _entry_i, d_present, _f_present = self._get_block_entry_index( old_path, b"", 1 ) if d_present: dir_block = self._dirblocks[block_i][1] for child_entry in dir_block: child_basis_kind = child_entry[1][1][0] if child_basis_kind not in b"ar": self._raise_invalid( old_path, file_id, "The file id was deleted but its children were not deleted.", ) def _after_delta_check_parents(self, parents, index): """Check that parents required by the delta are all intact. :param parents: An iterable of (path_utf8, file_id) tuples which are required to be present in tree 'index' at path_utf8 with id file_id and be a directory. :param index: The column in the dirstate to check for parents in. """ for dirname_utf8, file_id in parents: # Get the entry - the ensures that file_id, dirname_utf8 exists and # has the right file id. entry = self._get_entry(index, file_id, dirname_utf8) if entry[1] is None: self._raise_invalid( dirname_utf8.decode("utf8"), file_id, "This parent is not present." ) # Parents of things must be directories if entry[1][index][0] != b"d": self._raise_invalid( dirname_utf8.decode("utf8"), file_id, "This parent is not a directory.", ) def _observed_sha1( self, entry, sha1, stat_value, _stat_to_minikind=_stat_to_minikind ): """Note the sha1 of a file. :param entry: The entry the sha1 is for. :param sha1: The observed sha1. :param stat_value: The os.lstat for the file. """ try: minikind = _stat_to_minikind[stat_value.st_mode & 0o170000] except KeyError: # Unhandled kind return None if minikind == b"f": if self._cutoff_time is None: self._sha_cutoff_time() if ( stat_value.st_mtime < self._cutoff_time and stat_value.st_ctime < self._cutoff_time ): entry[1][0] = ( b"f", sha1, stat_value.st_size, entry[1][0][3], pack_stat(stat_value), ) self._mark_modified([entry]) def _sha_cutoff_time(self): """Return cutoff time. Files modified more recently than this time are at risk of being undetectably modified and so can't be cached. """ # Cache the cutoff time as long as we hold a lock. # time.time() isn't super expensive (approx 3.38us), but # when you call it 50,000 times it adds up. # For comparison, os.lstat() costs 7.2us if it is hot. self._cutoff_time = int(time.time()) - 3 return self._cutoff_time @staticmethod def _lstat(abspath, entry): """Return the os.lstat value for this path.""" return os.lstat(abspath) def _is_executable(self, mode, old_executable): """Is this file executable?""" if self._use_filesystem_for_exec: return bool(S_IEXEC & mode) else: return old_executable @staticmethod def _read_link(abspath, old_link): """Read the target of a symlink.""" # TODO: jam 200700301 On Win32, this could just return the value # already in memory. However, this really needs to be done at a # higher level, because there either won't be anything on disk, # or the thing on disk will be a file. if isinstance(abspath, str): # abspath is defined as the path to pass to lstat. readlink is # buggy in python < 2.6 (it doesn't encode unicode path into FS # encoding), so we need to encode ourselves knowing that unicode # paths are produced by UnicodeDirReader on purpose. abspath = os.fsencode(abspath) target = os.readlink(abspath) if sys.getfilesystemencoding() not in ("utf-8", "ascii"): # Change encoding if needed target = os.fsdecode(target).encode("UTF-8") return target def get_ghosts(self): """Return a list of the parent tree revision ids that are ghosts.""" self._read_header_if_needed() return self._ghosts def get_lines(self): """Serialise the entire dirstate to a sequence of lines.""" if ( self._header_state == DirState.IN_MEMORY_UNMODIFIED and self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED ): # read what's on disk. self._state_file.seek(0) return self._state_file.readlines() lines = [] lines.append(_get_parents_line(self.get_parent_ids())) lines.append(_get_ghosts_line(self._ghosts)) lines.extend(map(self._entry_to_line, self._iter_entries())) return _get_output_lines(lines) def _get_fields_to_entry(self): """Get a function which converts entry fields into a entry record. This handles size and executable, as well as parent records. :return: A function which takes a list of fields, and returns an appropriate record for storing in memory. """ # This is intentionally unrolled for performance num_present_parents = self._num_present_parents() if num_present_parents == 0: def fields_to_entry_0_parents(fields, _int=int): path_name_file_id_key = (fields[0], fields[1], fields[2]) return ( path_name_file_id_key, [ ( # Current tree fields[3], # minikind fields[4], # fingerprint _int(fields[5]), # size fields[6] == b"y", # executable fields[7], # packed_stat or revision_id ) ], ) return fields_to_entry_0_parents elif num_present_parents == 1: def fields_to_entry_1_parent(fields, _int=int): path_name_file_id_key = (fields[0], fields[1], fields[2]) return ( path_name_file_id_key, [ ( # Current tree fields[3], # minikind fields[4], # fingerprint _int(fields[5]), # size fields[6] == b"y", # executable fields[7], # packed_stat or revision_id ), ( # Parent 1 fields[8], # minikind fields[9], # fingerprint _int(fields[10]), # size fields[11] == b"y", # executable fields[12], # packed_stat or revision_id ), ], ) return fields_to_entry_1_parent elif num_present_parents == 2: def fields_to_entry_2_parents(fields, _int=int): path_name_file_id_key = (fields[0], fields[1], fields[2]) return ( path_name_file_id_key, [ ( # Current tree fields[3], # minikind fields[4], # fingerprint _int(fields[5]), # size fields[6] == b"y", # executable fields[7], # packed_stat or revision_id ), ( # Parent 1 fields[8], # minikind fields[9], # fingerprint _int(fields[10]), # size fields[11] == b"y", # executable fields[12], # packed_stat or revision_id ), ( # Parent 2 fields[13], # minikind fields[14], # fingerprint _int(fields[15]), # size fields[16] == b"y", # executable fields[17], # packed_stat or revision_id ), ], ) return fields_to_entry_2_parents else: def fields_to_entry_n_parents(fields, _int=int): path_name_file_id_key = (fields[0], fields[1], fields[2]) trees = [ ( fields[cur], # minikind fields[cur + 1], # fingerprint _int(fields[cur + 2]), # size fields[cur + 3] == b"y", # executable fields[cur + 4], # stat or revision_id ) for cur in range(3, len(fields) - 1, 5) ] return path_name_file_id_key, trees return fields_to_entry_n_parents def get_parent_ids(self): """Return a list of the parent tree ids for the directory state.""" self._read_header_if_needed() return list(self._parents) def _get_block_entry_index(self, dirname, basename, tree_index): """Get the coordinates for a path in the state structure. :param dirname: The utf8 dirname to lookup. :param basename: The utf8 basename to lookup. :param tree_index: The index of the tree for which this lookup should be attempted. :return: A tuple describing where the path is located, or should be inserted. The tuple contains four fields: the block index, the row index, the directory is present (boolean), the entire path is present (boolean). There is no guarantee that either coordinate is currently reachable unless the found field for it is True. For instance, a directory not present in the searched tree may be returned with a value one greater than the current highest block offset. The directory present field will always be True when the path present field is True. The directory present field does NOT indicate that the directory is present in the searched tree, rather it indicates that there are at least some files in some tree present there. """ self._read_dirblocks_if_needed() key = dirname, basename, b"" block_index, present = self._find_block_index_from_key(key) if not present: # no such directory - return the dir index and 0 for the row. return block_index, 0, False, False block = self._dirblocks[block_index][1] # access the entries only entry_index, present = self._find_entry_index(key, block) # linear search through entries at this path to find the one # requested. while entry_index < len(block) and block[entry_index][0][1] == basename: if block[entry_index][1][tree_index][0] not in (b"a", b"r"): # neither absent or relocated return block_index, entry_index, True, True entry_index += 1 return block_index, entry_index, True, False def _get_entry( self, tree_index, fileid_utf8=None, path_utf8=None, include_deleted=False ): """Get the dirstate entry for path in tree tree_index. If either file_id or path is supplied, it is used as the key to lookup. If both are supplied, the fastest lookup is used, and an error is raised if they do not both point at the same row. :param tree_index: The index of the tree we wish to locate this path in. If the path is present in that tree, the entry containing its details is returned, otherwise (None, None) is returned 0 is the working tree, higher indexes are successive parent trees. :param fileid_utf8: A utf8 file_id to look up. :param path_utf8: An utf8 path to be looked up. :param include_deleted: If True, and performing a lookup via fileid_utf8 rather than path_utf8, return an entry for deleted (absent) paths. :return: The dirstate entry tuple for path, or (None, None) """ self._read_dirblocks_if_needed() if path_utf8 is not None: if not isinstance(path_utf8, bytes): raise BzrFormatsError( f"path_utf8 is not bytes: {type(path_utf8)} {path_utf8!r}" ) # path lookups are faster dirname, basename = osutils.split(path_utf8) ( block_index, entry_index, _dir_present, file_present, ) = self._get_block_entry_index(dirname, basename, tree_index) if not file_present: return None, None entry = self._dirblocks[block_index][1][entry_index] if not (entry[0][2] and entry[1][tree_index][0] not in (b"a", b"r")): raise AssertionError("unversioned entry?") if fileid_utf8 and entry[0][2] != fileid_utf8: self._changes_aborted = True raise BzrFormatsError( "integrity error ? : mismatching tree_index, file_id and path" ) return entry else: possible_keys = self._get_id_index().get(fileid_utf8) if not possible_keys: return None, None for key in possible_keys: block_index, present = self._find_block_index_from_key(key) # strange, probably indicates an out of date # id index - for now, allow this. if not present: continue # WARNING: DO not change this code to use _get_block_entry_index # as that function is not suitable: it does not use the key # to lookup, and thus the wrong coordinates are returned. block = self._dirblocks[block_index][1] entry_index, present = self._find_entry_index(key, block) if present: entry = self._dirblocks[block_index][1][entry_index] # TODO: We might want to assert that entry[0][2] == # fileid_utf8. # GZ 2017-06-09: Hoist set of minkinds somewhere if entry[1][tree_index][0] in {b"f", b"d", b"l", b"t"}: # this is the result we are looking for: the # real home of this file_id in this tree. return entry if entry[1][tree_index][0] == b"a": # there is no home for this entry in this tree if include_deleted: return entry return None, None if entry[1][tree_index][0] != b"r": raise AssertionError( "entry {!r} has invalid minikind {!r} for tree {!r}".format( entry, entry[1][tree_index][0], tree_index ) ) real_path = entry[1][tree_index][1] return self._get_entry( tree_index, fileid_utf8=fileid_utf8, path_utf8=real_path ) return None, None @classmethod def initialize(cls, path, sha1_provider=None): """Create a new dirstate on path. The new dirstate will be an empty tree - that is it has no parents, and only a root node - which has id ROOT_ID. :param path: The name of the file for the dirstate. :param sha1_provider: an object meeting the SHA1Provider interface. If None, a DefaultSHA1Provider is used. :return: A write-locked DirState object. """ # This constructs a new DirState object on a path, sets the _state_file # to a new empty file for that path. It then calls _set_data() with our # stock empty dirstate information - a root with ROOT_ID, no children, # and no parents. Finally it calls save() to ensure that this data will # persist. if sha1_provider is None: sha1_provider = DefaultSHA1Provider() result = cls(path, sha1_provider) # root dir and root dir contents with no children. empty_tree_dirblocks = [(b"", []), (b"", [])] # a new root directory, with a NULLSTAT. empty_tree_dirblocks[0][1].append( ( (b"", b"", inventory.ROOT_ID), [ (b"d", b"", 0, False, DirState.NULLSTAT), ], ) ) result.lock_write() try: result._set_data([], empty_tree_dirblocks) result.save() except: result.unlock() raise return result def _iter_child_entries(self, tree_index, path_utf8): """Iterate over all the entries that are children of path_utf. This only returns entries that are present (not in b'a', b'r') in tree_index. tree_index data is not refreshed, so if tree 0 is used, results may differ from that obtained if paths were statted to determine what ones were directories. Asking for the children of a non-directory will return an empty iterator. """ pending_dirs = [] next_pending_dirs = [path_utf8] absent = (b"a", b"r") while next_pending_dirs: pending_dirs = next_pending_dirs next_pending_dirs = [] for path in pending_dirs: block_index, present = self._find_block_index_from_key((path, b"", b"")) if block_index == 0: block_index = 1 if len(self._dirblocks) == 1: # asked for the children of the root with no other # contents. return if not present: # children of a non-directory asked for. continue block = self._dirblocks[block_index] for entry in block[1]: kind = entry[1][tree_index][0] if kind not in absent: yield entry if kind == b"d": if entry[0][0]: path = entry[0][0] + b"/" + entry[0][1] else: path = entry[0][1] next_pending_dirs.append(path) def _iter_entries(self): """Iterate over all the entries in the dirstate. Each yelt item is an entry in the standard format described in the docstring of bzrformats.dirstate. """ self._read_dirblocks_if_needed() for directory in self._dirblocks: yield from directory[1] def _get_id_index(self): """Get an id index of self._dirblocks. This maps from file_id => [(directory, name, file_id)] entries where that file_id appears in one of the trees. """ if self._id_index is None: id_index = IdIndex() for key, _tree_details in self._iter_entries(): id_index.add(key) self._id_index = id_index return self._id_index @classmethod def _make_deleted_row(cls, fileid_utf8, parents): """Return a deleted row for fileid_utf8.""" return ( b"/", b"RECYCLED.BIN", b"file", fileid_utf8, 0, DirState.NULLSTAT, b"", ), parents def _num_present_parents(self): """The number of parent entries in each record row.""" return len(self._parents) - len(self._ghosts) @classmethod def on_file( cls, path, sha1_provider=None, worth_saving_limit=0, use_filesystem_for_exec=True, fdatasync=False, ): """Construct a DirState on the file at path "path". :param path: The path at which the dirstate file on disk should live. :param sha1_provider: an object meeting the SHA1Provider interface. If None, a DefaultSHA1Provider is used. :param worth_saving_limit: when the exact number of hash changed entries is known, only bother saving the dirstate if more than this count of entries have changed. -1 means never save. :param use_filesystem_for_exec: Whether to trust the filesystem for executable bit information :return: An unlocked DirState object, associated with the given path. """ if sha1_provider is None: sha1_provider = DefaultSHA1Provider() result = cls( path, sha1_provider, worth_saving_limit=worth_saving_limit, use_filesystem_for_exec=use_filesystem_for_exec, fdatasync=fdatasync, ) return result def _read_dirblocks_if_needed(self): """Read in all the dirblocks from the file if they are not in memory. This populates self._dirblocks, and sets self._dirblock_state to IN_MEMORY_UNMODIFIED. It is not currently ready for incremental block loading. """ self._read_header_if_needed() if self._dirblock_state == DirState.NOT_IN_MEMORY: _read_dirblocks(self) def _read_header(self): """This reads in the metadata header, and the parent ids. After reading in, the file should be positioned at the null just before the start of the first record in the file. :return: (expected crc checksum, number of entries, parent list) """ self._read_prelude() parent_line = self._state_file.readline() info = parent_line.split(b"\0") int(info[0]) self._parents = info[1:-1] ghost_line = self._state_file.readline() info = ghost_line.split(b"\0") int(info[1]) self._ghosts = info[2:-1] self._header_state = DirState.IN_MEMORY_UNMODIFIED self._end_of_header = self._state_file.tell() def _read_header_if_needed(self): """Read the header of the dirstate file if needed.""" # inline this as it will be called a lot if not self._lock_token: raise ObjectNotLocked(self) if self._header_state == DirState.NOT_IN_MEMORY: self._read_header() def _read_prelude(self): """Read in the prelude header of the dirstate file. This only reads in the stuff that is not connected to the crc checksum. The position will be correct to read in the rest of the file and check the checksum after this point. The next entry in the file should be the number of parents, and their ids. Followed by a newline. """ header = self._state_file.readline() if header != DirState.HEADER_FORMAT_3: raise BzrFormatsError(f"invalid header line: {header!r}") crc_line = self._state_file.readline() if not crc_line.startswith(b"crc32: "): raise BzrFormatsError(f"missing crc32 checksum: {crc_line!r}") self.crc_expected = int(crc_line[len(b"crc32: ") : -1]) num_entries_line = self._state_file.readline() if not num_entries_line.startswith(b"num_entries: "): raise BzrFormatsError("missing num_entries line") self._num_entries = int(num_entries_line[len(b"num_entries: ") : -1]) def sha1_from_stat(self, path, stat_result): """Find a sha1 given a stat lookup.""" return self._get_packed_stat_index().get(pack_stat(stat_result), None) def _get_packed_stat_index(self): """Get a packed_stat index of self._dirblocks.""" if self._packed_stat_index is None: index = {} for _key, tree_details in self._iter_entries(): if tree_details[0][0] == b"f": index[tree_details[0][4]] = tree_details[0][1] self._packed_stat_index = index return self._packed_stat_index def save(self): """Save any pending changes created during this session. We reuse the existing file, because that prevents race conditions with file creation, and use oslocks on it to prevent concurrent modification and reads - because dirstate's incremental data aggregation is not compatible with reading a modified file, and replacing a file in use by another process is impossible on Windows. A dirstate in read only mode should be smart enough though to validate that the file has not changed, and otherwise discard its cache and start over, to allow for fine grained read lock duration, so 'status' wont block 'commit' - for example. """ if self._changes_aborted: # Should this be a warning? For now, I'm expecting that places that # mark it inconsistent will warn, making a warning here redundant. logger.debug("Not saving DirState because _changes_aborted is set.") return # TODO: Since we now distinguish IN_MEMORY_MODIFIED from # IN_MEMORY_HASH_MODIFIED, we should only fail quietly if we fail # to save an IN_MEMORY_HASH_MODIFIED, and fail *noisily* if we # fail to save IN_MEMORY_MODIFIED if not self._worth_saving(): return grabbed_write_lock = False if self._lock_state != "w": grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock() # Switch over to the new lock, as the old one may be closed. # TODO: jam 20070315 We should validate the disk file has # not changed contents, since temporary_write_lock may # not be an atomic operation. self._lock_token = new_lock self._state_file = new_lock.f if not grabbed_write_lock: # We couldn't grab a write lock, so we switch back to a read one return try: lines = self.get_lines() self._state_file.seek(0) self._state_file.writelines(lines) self._state_file.truncate() self._state_file.flush() self._maybe_fdatasync() self._mark_unmodified() finally: if grabbed_write_lock: self._lock_token = self._lock_token.restore_read_lock() self._state_file = self._lock_token.f # TODO: jam 20070315 We should validate the disk file has # not changed contents. Since restore_read_lock may # not be an atomic operation. def _maybe_fdatasync(self): """Flush to disk if possible and if not configured off.""" if self._fdatasync: osutils.fdatasync(self._state_file.fileno()) def _worth_saving(self): """Is it worth saving the dirstate or not?""" if ( self._header_state == DirState.IN_MEMORY_MODIFIED or self._dirblock_state == DirState.IN_MEMORY_MODIFIED ): return True if self._dirblock_state == DirState.IN_MEMORY_HASH_MODIFIED: if self._worth_saving_limit == -1: # We never save hash changes when the limit is -1 return False # If we're using smart saving and only a small number of # entries have changed their hash, don't bother saving. John has # suggested using a heuristic here based on the size of the # changed files and/or tree. For now, we go with a configurable # number of changes, keeping the calculation time # as low overhead as possible. (This also keeps all existing # tests passing as the default is 0, i.e. always save.) if len(self._known_hash_changes) >= self._worth_saving_limit: return True return False def _set_data(self, parent_ids, dirblocks): """Set the full dirstate data in memory. This is an internal function used to completely replace the objects in memory state. It puts the dirstate into state 'full-dirty'. :param parent_ids: A list of parent tree revision ids. :param dirblocks: A list containing one tuple for each directory in the tree. Each tuple contains the directory path and a list of entries found in that directory. """ # our memory copy is now authoritative. self._dirblocks = dirblocks self._mark_modified(header_modified=True) self._parents = list(parent_ids) self._id_index = None self._packed_stat_index = None def set_path_id(self, path, new_id): """Change the id of path to new_id in the current working tree. :param path: The path inside the tree to set - b'' is the root, 'foo' is the path foo in the root. :param new_id: The new id to assign to the path. This must be a utf8 file id (not unicode, and not None). """ self._read_dirblocks_if_needed() if len(path): # TODO: logic not written raise NotImplementedError(self.set_path_id) # TODO: check new id is unique entry = self._get_entry(0, path_utf8=path) if entry[0][2] == new_id: # Nothing to change. return if not isinstance(new_id, bytes): raise AssertionError(f"must be a utf8 file_id not {type(new_id)}") # mark the old path absent, and insert a new root path self._make_absent(entry) self.update_minimal( (b"", b"", new_id), b"d", path_utf8=b"", packed_stat=entry[1][0][4] ) self._mark_modified() def set_parent_trees(self, trees, ghosts): """Set the parent trees for the dirstate. :param trees: A list of revision_id, tree tuples. tree must be provided even if the revision_id refers to a ghost: supply an empty tree in this case. :param ghosts: A list of the revision_ids that are ghosts at the time of setting. """ # TODO: generate a list of parent indexes to preserve to save # processing specific parent trees. In the common case one tree will # be preserved - the left most parent. # TODO: if the parent tree is a dirstate, we might want to walk them # all by path in parallel for 'optimal' common-case performance. # generate new root row. self._read_dirblocks_if_needed() # TODO future sketch: Examine the existing parents to generate a change # map and then walk the new parent trees only, mapping them into the # dirstate. Walk the dirstate at the same time to remove unreferenced # entries. # for now: # sketch: loop over all entries in the dirstate, cherry picking # entries from the parent trees, if they are not ghost trees. # after we finish walking the dirstate, all entries not in the dirstate # are deletes, so we want to append them to the end as per the design # discussions. So do a set difference on ids with the parents to # get deletes, and add them to the end. # During the update process we need to answer the following questions: # - find other keys containing a fileid in order to create cross-path # links. We dont't trivially use the inventory from other trees # because this leads to either double touching, or to accessing # missing keys, # - find other keys containing a path # We accumulate each entry via this dictionary, including the root by_path = {} id_index = IdIndex() # we could do parallel iterators, but because file id data may be # scattered throughout, we dont save on index overhead: we have to look # at everything anyway. We can probably save cycles by reusing parent # data and doing an incremental update when adding an additional # parent, but for now the common cases are adding a new parent (merge), # and replacing completely (commit), and commit is more common: so # optimise merge later. # ---- start generation of full tree mapping data # what trees should we use? parent_trees = [tree for rev_id, tree in trees if rev_id not in ghosts] # how many trees do we end up with parent_count = len(parent_trees) # one: the current tree for entry in self._iter_entries(): # skip entries not in the current tree if entry[1][0][0] in (b"a", b"r"): # absent, relocated continue by_path[entry[0]] = [entry[1][0]] + [ DirState.NULL_PARENT_DETAILS ] * parent_count # TODO: Possibly inline this, since we know it isn't present yet # id_index[entry[0][2]] = (entry[0],) id_index.add(entry[0]) # now the parent trees: for tree_index, tree in enumerate(parent_trees): # the index is off by one, adjust it. tree_index = tree_index + 1 # when we add new locations for a fileid we need these ranges for # any fileid in this tree as we set the by_path[id] to: # already_processed_tree_details + new_details + new_location_suffix # the suffix is from tree_index+1:parent_count+1. new_location_suffix = [DirState.NULL_PARENT_DETAILS] * ( parent_count - tree_index ) # now stitch in all the entries from this tree last_dirname = None for path, entry in tree.iter_entries_by_dir(): # here we process each trees details for each item in the tree. # we first update any existing entries for the id at other paths, # then we either create or update the entry for the id at the # right path, and finally we add (if needed) a mapping from # file_id to this path. We do it in this order to allow us to # avoid checking all known paths for the id when generating a # new entry at this path: by adding the id->path mapping last, # all the mappings are valid and have correct relocation # records where needed. file_id = entry.file_id path_utf8 = path.encode("utf8") dirname, basename = osutils.split(path_utf8) if dirname == last_dirname: # Try to re-use objects as much as possible dirname = last_dirname else: last_dirname = dirname new_entry_key = (dirname, basename, file_id) # tree index consistency: All other paths for this id in this tree # index must point to the correct path. entry_keys = id_index.get(file_id) for entry_key in entry_keys: # TODO:PROFILING: It might be faster to just update # rather than checking if we need to, and then overwrite # the one we are located at. if entry_key != new_entry_key: # this file id is at a different path in one of the # other trees, so put absent pointers there # This is the vertical axis in the matrix, all pointing # to the real path. by_path[entry_key][tree_index] = ( b"r", path_utf8, 0, False, b"", ) # by path consistency: Insert into an existing path record # (trivial), or add a new one with relocation pointers for the # other tree indexes. if new_entry_key in entry_keys: # there is already an entry where this data belongs, just # insert it. by_path[new_entry_key][tree_index] = _inv_entry_to_details(entry) else: # add relocated entries to the horizontal axis - this row # mapping from path,id. We need to look up the correct path # for the indexes from 0 to tree_index -1 new_details = [] for lookup_index in range(tree_index): # boundary case: this is the first occurence of file_id # so there are no id_indexes, possibly take this out of # the loop? if not len(entry_keys): new_details.append(DirState.NULL_PARENT_DETAILS) else: # grab any one entry, use it to find the right path. a_key = next(iter(entry_keys)) if by_path[a_key][lookup_index][0] in (b"r", b"a"): # its a pointer or missing statement, use it as # is. new_details.append(by_path[a_key][lookup_index]) else: # we have the right key, make a pointer to it. real_path = (b"/".join(a_key[0:2])).strip(b"/") new_details.append((b"r", real_path, 0, False, b"")) new_details.append(_inv_entry_to_details(entry)) new_details.extend(new_location_suffix) by_path[new_entry_key] = new_details id_index.add(new_entry_key) # --- end generation of full tree mappings # sort and output all the entries new_entries = self._sort_entries(by_path.items()) self._entries_to_current_state(new_entries) self._parents = [rev_id for rev_id, tree in trees] self._ghosts = list(ghosts) self._mark_modified(header_modified=True) self._id_index = id_index @staticmethod def _sort_entries(entry_list): """Given a list of entries, sort them into the right order. This is done when constructing a new dirstate from trees - normally we try to keep everything in sorted blocks all the time, but sometimes it's easier to sort after the fact. """ # When sorting, we usually have 10x more entries than directories. (69k # total entries, 4k directories). So cache the results of splitting. # Saving time and objects. split_dirs = {} def _key(entry, _split_dirs=split_dirs): # sort by: directory parts, file name, file id dirpath, fname, file_id = entry[0] try: split = _split_dirs[dirpath] except KeyError: split = tuple(dirpath.split(b"/")) _split_dirs[dirpath] = split return (split, fname, file_id) return sorted(entry_list, key=_key) def set_state_from_inventory(self, new_inv): """Set new_inv as the current state. This API is called by tree transform, and will usually occur with existing parent trees. :param new_inv: The inventory object to set current state from. """ evil_logger.debug( "set_state_from_inventory called; please mutate the tree instead" ) tracing = logger.isEnabledFor(logging.DEBUG) if tracing: logger.debug("set_state_from_inventory trace:") self._read_dirblocks_if_needed() # sketch: # Two iterators: current data and new data, both in dirblock order. # We zip them together, which tells about entries that are new in the # inventory, or removed in the inventory, or present in both and # possibly changed. # # You might think we could just synthesize a new dirstate directly # since we're processing it in the right order. However, we need to # also consider there may be any number of parent trees and relocation # pointers, and we don't want to duplicate that here. new_iterator = new_inv.iter_entries_by_dir() # we will be modifying the dirstate, so we need a stable iterator. In # future we might write one, for now we just clone the state into a # list using a copy so that we see every original item and don't have # to adjust the position when items are inserted or deleted in the # underlying dirstate. old_iterator = iter(list(self._iter_entries())) # both must have roots so this is safe: current_new = next(new_iterator) current_old = next(old_iterator) def advance(iterator): try: return next(iterator) except StopIteration: return None while current_new or current_old: # skip entries in old that are not really there if current_old and current_old[1][0][0] in (b"a", b"r"): # relocated or absent current_old = advance(old_iterator) continue if current_new: # convert new into dirblock style new_path_utf8 = current_new[0].encode("utf8") new_dirname, new_basename = osutils.split(new_path_utf8) new_id = current_new[1].file_id new_entry_key = (new_dirname, new_basename, new_id) current_new_minikind = DirState._kind_to_minikind[current_new[1].kind] if current_new_minikind == b"t": fingerprint = current_new[1].reference_revision or b"" else: # We normally only insert or remove records, or update # them when it has significantly changed. Then we want to # erase its fingerprint. Unaffected records should # normally not be updated at all. fingerprint = b"" else: # for safety disable variables new_path_utf8 = new_dirname = new_basename = new_id = new_entry_key = ( None ) # 5 cases, we dont have a value that is strictly greater than everything, so # we make both end conditions explicit if not current_old: # old is finished: insert current_new into the state. if tracing: logger.debug( "Appending from new '%s'.", new_path_utf8.decode("utf8") ) self.update_minimal( new_entry_key, current_new_minikind, executable=current_new[1].executable, path_utf8=new_path_utf8, fingerprint=fingerprint, fullscan=True, ) current_new = advance(new_iterator) elif not current_new: # new is finished if tracing: logger.debug( "Truncating from old '%s/%s'.", current_old[0][0].decode("utf8"), current_old[0][1].decode("utf8"), ) self._make_absent(current_old) current_old = advance(old_iterator) elif new_entry_key == current_old[0]: # same - common case # We're looking at the same path and id in both the dirstate # and inventory, so just need to update the fields in the # dirstate from the one in the inventory. # TODO: update the record if anything significant has changed. # the minimal required trigger is if the execute bit or cached # kind has changed. if ( current_old[1][0][3] != current_new[1].executable or current_old[1][0][0] != current_new_minikind ): if tracing: logger.debug( "Updating in-place change '%s'.", new_path_utf8.decode("utf8"), ) self.update_minimal( current_old[0], current_new_minikind, executable=current_new[1].executable, path_utf8=new_path_utf8, fingerprint=fingerprint, fullscan=True, ) # both sides are dealt with, move on current_old = advance(old_iterator) current_new = advance(new_iterator) elif lt_by_dirs(new_dirname, current_old[0][0]) or ( new_dirname == current_old[0][0] and new_entry_key[1:] < current_old[0][1:] ): # new comes before: # add a entry for this and advance new if tracing: logger.debug( "Inserting from new '%s'.", new_path_utf8.decode("utf8") ) self.update_minimal( new_entry_key, current_new_minikind, executable=current_new[1].executable, path_utf8=new_path_utf8, fingerprint=fingerprint, fullscan=True, ) current_new = advance(new_iterator) else: # we've advanced past the place where the old key would be, # without seeing it in the new list. so it must be gone. if tracing: logger.debug( "Deleting from old '%s/%s'.", current_old[0][0].decode("utf8"), current_old[0][1].decode("utf8"), ) self._make_absent(current_old) current_old = advance(old_iterator) self._mark_modified() self._id_index = None self._packed_stat_index = None if tracing: logger.debug("set_state_from_inventory complete.") def set_state_from_scratch(self, working_inv, parent_trees, parent_ghosts): """Wipe the currently stored state and set it to something new. This is a hard-reset for the data we are working with. """ # Technically, we really want a write lock, but until we write, we # don't really need it. self._requires_lock() # root dir and root dir contents with no children. We have to have a # root for set_state_from_inventory to work correctly. empty_root = ( (b"", b"", inventory.ROOT_ID), [(b"d", b"", 0, False, DirState.NULLSTAT)], ) empty_tree_dirblocks = [(b"", [empty_root]), (b"", [])] self._set_data([], empty_tree_dirblocks) self.set_state_from_inventory(working_inv) self.set_parent_trees(parent_trees, parent_ghosts) def _make_absent(self, current_old): """Mark current_old - an entry - as absent for tree 0. :return: True if this was the last details entry for the entry key: that is, if the underlying block has had the entry removed, thus shrinking in length. """ # build up paths that this id will be left at after the change is made, # so we can update their cross references in tree 0 all_remaining_keys = set() # Dont check the working tree, because it's going. for details in current_old[1][1:]: if details[0] not in (b"a", b"r"): # absent, relocated all_remaining_keys.add(current_old[0]) elif details[0] == b"r": # relocated # record the key for the real path. all_remaining_keys.add( tuple(osutils.split(details[1])) + (current_old[0][2],) ) # absent rows are not present at any path. last_reference = current_old[0] not in all_remaining_keys if last_reference: # the current row consists entire of the current item (being marked # absent), and relocated or absent entries for the other trees: # Remove it, its meaningless. block = self._find_block(current_old[0]) entry_index, present = self._find_entry_index(current_old[0], block[1]) if not present: raise AssertionError(f"could not find entry for {current_old}") block[1].pop(entry_index) # if we have an id_index in use, remove this key from it for this id. if self._id_index is not None: self._id_index.remove(current_old[0]) # update all remaining keys for this id to record it as absent. The # existing details may either be the record we are marking as deleted # (if there were other trees with the id present at this path), or may # be relocations. for update_key in all_remaining_keys: update_block_index, present = self._find_block_index_from_key(update_key) if not present: raise AssertionError(f"could not find block for {update_key}") update_entry_index, present = self._find_entry_index( update_key, self._dirblocks[update_block_index][1] ) if not present: raise AssertionError(f"could not find entry for {update_key}") update_tree_details = self._dirblocks[update_block_index][1][ update_entry_index ][1] # it must not be absent at the moment if update_tree_details[0][0] == b"a": # absent raise AssertionError(f"bad row {update_tree_details!r}") update_tree_details[0] = DirState.NULL_PARENT_DETAILS self._mark_modified() return last_reference def update_minimal( self, key, minikind, executable=False, fingerprint=b"", packed_stat=None, size=0, path_utf8=None, fullscan=False, ): """Update an entry to the state in tree 0. This will either create a new entry at 'key' or update an existing one. It also makes sure that any other records which might mention this are updated as well. :param key: (dir, name, file_id) for the new entry :param minikind: The type for the entry (b'f' == 'file', b'd' == 'directory'), etc. :param executable: Should the executable bit be set? :param fingerprint: Simple fingerprint for new entry: canonical-form sha1 for files, referenced revision id for subtrees, etc. :param packed_stat: Packed stat value for new entry. :param size: Size information for new entry :param path_utf8: key[0] + '/' + key[1], just passed in to avoid doing extra computation. :param fullscan: If True then a complete scan of the dirstate is being done and checking for duplicate rows should not be done. This should only be set by set_state_from_inventory and similar methods. If packed_stat and fingerprint are not given, they're invalidated in the entry. """ block = self._find_block(key)[1] if packed_stat is None: packed_stat = DirState.NULLSTAT # XXX: Some callers pass b'' as the packed_stat, and it seems to be # sometimes present in the dirstate - this seems oddly inconsistent. # mbp 20071008 entry_index, present = self._find_entry_index(key, block) new_details = (minikind, fingerprint, size, executable, packed_stat) id_index = self._get_id_index() if not present: # New record. Check there isn't a entry at this path already. if not fullscan: low_index, _ = self._find_entry_index(key[0:2] + (b"",), block) while low_index < len(block): entry = block[low_index] if entry[0][0:2] == key[0:2]: if entry[1][0][0] not in (b"a", b"r"): # This entry has the same path (but a different id) as # the new entry we're adding, and is present in ths # tree. self._raise_invalid( (b"%s/%s" % key[0:2]).decode("utf8"), key[2], "Attempt to add item at path already occupied by " "id {!r}".format(entry[0][2]), ) low_index += 1 else: break # new entry, synthesis cross reference here, existing_keys = id_index.get(key[2]) if not existing_keys: # not currently in the state, simplest case new_entry = key, [new_details] + self._empty_parent_info() else: # present at one or more existing other paths. # grab one of them and use it to generate parent # relocation/absent entries. new_entry = key, [new_details] # existing_keys can be changed as we iterate. for other_key in tuple(existing_keys): # change the record at other to be a pointer to this new # record. The loop looks similar to the change to # relocations when updating an existing record but its not: # the test for existing kinds is different: this can be # factored out to a helper though. other_block_index, present = self._find_block_index_from_key( other_key ) if not present: raise AssertionError(f"could not find block for {other_key}") other_block = self._dirblocks[other_block_index][1] other_entry_index, present = self._find_entry_index( other_key, other_block ) if not present: raise AssertionError( f"update_minimal: could not find other entry for {other_key}" ) if path_utf8 is None: raise AssertionError("no path") # Turn this other location into a reference to the new # location. This also updates the aliased iterator # (current_old in set_state_from_inventory) so that the old # entry, if not already examined, is skipped over by that # loop. other_entry = other_block[other_entry_index] other_entry[1][0] = (b"r", path_utf8, 0, False, b"") if self._maybe_remove_row(other_block, other_entry_index, id_index): # If the row holding this was removed, we need to # recompute where this entry goes entry_index, _ = self._find_entry_index(key, block) # This loop: # adds a tuple to the new details for each column # - either by copying an existing relocation pointer inside that column # - or by creating a new pointer to the right row inside that column num_present_parents = self._num_present_parents() if num_present_parents: # TODO: This re-evaluates the existing_keys set, do we need # to do that ourselves? other_key = list(existing_keys)[0] for lookup_index in range(1, num_present_parents + 1): # grab any one entry, use it to find the right path. # TODO: optimise this to reduce memory use in highly # fragmented situations by reusing the relocation # records. update_block_index, present = self._find_block_index_from_key( other_key ) if not present: raise AssertionError(f"could not find block for {other_key}") update_entry_index, present = self._find_entry_index( other_key, self._dirblocks[update_block_index][1] ) if not present: raise AssertionError( f"update_minimal: could not find entry for {other_key}" ) update_details = self._dirblocks[update_block_index][1][ update_entry_index ][1][lookup_index] if update_details[0] in (b"a", b"r"): # relocated, absent # its a pointer or absent in lookup_index's tree, use # it as is. new_entry[1].append(update_details) else: # we have the right key, make a pointer to it. pointer_path = osutils.pathjoin(*other_key[0:2]) new_entry[1].append((b"r", pointer_path, 0, False, b"")) block.insert(entry_index, new_entry) id_index.add(key) else: # Does the new state matter? block[entry_index][1][0] = new_details # parents cannot be affected by what we do. # other occurences of this id can be found # from the id index. # --- # tree index consistency: All other paths for this id in this tree # index must point to the correct path. We have to loop here because # we may have passed entries in the state with this file id already # that were absent - where parent entries are - and they need to be # converted to relocated. if path_utf8 is None: raise AssertionError("no path") existing_keys = id_index.get(key[2]) if key not in existing_keys: raise AssertionError( "We found the entry in the blocks, but" " the key is not in the id_index." f" key: {key}, existing_keys: {existing_keys}" ) for entry_key in existing_keys: # TODO:PROFILING: It might be faster to just update # rather than checking if we need to, and then overwrite # the one we are located at. if entry_key != key: # this file id is at a different path in one of the # other trees, so put absent pointers there # This is the vertical axis in the matrix, all pointing # to the real path. block_index, present = self._find_block_index_from_key(entry_key) if not present: raise AssertionError("not present: %r", entry_key) entry_index, present = self._find_entry_index( entry_key, self._dirblocks[block_index][1] ) if not present: raise AssertionError("not present: %r", entry_key) self._dirblocks[block_index][1][entry_index][1][0] = ( b"r", path_utf8, 0, False, b"", ) # add a containing dirblock if needed. if new_details[0] == b"d": # GZ 2017-06-09: Using pathjoin why? subdir_key = (osutils.pathjoin(*key[0:2]), b"", b"") block_index, present = self._find_block_index_from_key(subdir_key) if not present: self._dirblocks.insert(block_index, (subdir_key[0], [])) self._mark_modified() def _maybe_remove_row(self, block, index, id_index): """Remove index if it is absent or relocated across the row. id_index is updated accordingly. :return: True if we removed the row, False otherwise """ present_in_row = False entry = block[index] for column in entry[1]: if column[0] not in (b"a", b"r"): present_in_row = True break if not present_in_row: block.pop(index) id_index.remove(entry[0]) return True return False def _validate(self): """Check that invariants on the dirblock are correct. This can be useful in debugging; it shouldn't be necessary in normal code. This must be called with a lock held. """ # NOTE: This must always raise AssertionError not just assert, # otherwise it may not behave properly under python -O # # TODO: All entries must have some content that's not b'a' or b'r', # otherwise it could just be removed. # # TODO: All relocations must point directly to a real entry. # # TODO: No repeated keys. # # -- mbp 20070325 from pprint import pformat self._read_dirblocks_if_needed() if len(self._dirblocks) > 0 and not self._dirblocks[0][0] == b"": raise AssertionError( "dirblocks don't start with root block:\n" + pformat(self._dirblocks) ) if len(self._dirblocks) > 1 and not self._dirblocks[1][0] == b"": raise AssertionError( "dirblocks missing root directory:\n" + pformat(self._dirblocks) ) # the dirblocks are sorted by their path components, name, and dir id dir_names = [d[0].split(b"/") for d in self._dirblocks[1:]] if dir_names != sorted(dir_names): raise AssertionError( "dir names are not in sorted order:\n" + pformat(self._dirblocks) + "\nkeys:\n" + pformat(dir_names) ) for dirblock in self._dirblocks: # within each dirblock, the entries are sorted by filename and # then by id. for entry in dirblock[1]: if dirblock[0] != entry[0][0]: raise AssertionError( f"entry key for {entry!r}" f"doesn't match directory name in\n{pformat(dirblock)!r}" ) if dirblock[1] != sorted(dirblock[1]): raise AssertionError( f"dirblock for {dirblock[0]!r} is not sorted:\n{pformat(dirblock)}" ) def check_valid_parent(): """Check that the current entry has a valid parent. This makes sure that the parent has a record, and that the parent isn't marked as "absent" in the current tree. (It is invalid to have a non-absent file in an absent directory.) """ if entry[0][0:2] == (b"", b""): # There should be no parent for the root row return parent_entry = self._get_entry(tree_index, path_utf8=entry[0][0]) if parent_entry == (None, None): raise AssertionError( f"no parent entry for: {this_path} in tree {tree_index}" ) if parent_entry[1][tree_index][0] != b"d": raise AssertionError( f"Parent entry for {this_path} is not marked as a valid" f" directory. {parent_entry}" ) # For each file id, for each tree: either # the file id is not present at all; all rows with that id in the # key have it marked as 'absent' # OR the file id is present under exactly one name; any other entries # that mention that id point to the correct name. # # We check this with a dict per tree pointing either to the present # name, or None if absent. tree_count = self._num_present_parents() + 1 id_path_maps = [{} for _ in range(tree_count)] # Make sure that all renamed entries point to the correct location. for entry in self._iter_entries(): file_id = entry[0][2] this_path = osutils.pathjoin(entry[0][0], entry[0][1]) if len(entry[1]) != tree_count: raise AssertionError( "wrong number of entry details for row\n%s" ",\nexpected %d" % (pformat(entry), tree_count) ) absent_positions = 0 for tree_index, tree_state in enumerate(entry[1]): this_tree_map = id_path_maps[tree_index] minikind = tree_state[0] if minikind in (b"a", b"r"): absent_positions += 1 # have we seen this id before in this column? if file_id in this_tree_map: previous_path, previous_loc = this_tree_map[file_id] # any later mention of this file must be consistent with # what was said before if minikind == b"a": if previous_path is not None: raise AssertionError( "file {} is absent in row {!r} but also present " "at {!r}".format( file_id.decode("utf-8"), entry, previous_path ) ) elif minikind == b"r": target_location = tree_state[1] if previous_path != target_location: raise AssertionError( f"file {file_id} relocation in row {entry!r} but also at {previous_path!r}" ) else: # a file, directory, etc - may have been previously # pointed to by a relocation, which must point here if previous_path != this_path: raise AssertionError( "entry {!r} inconsistent with previous path {!r} " "seen at {!r}".format( entry, previous_path, previous_loc ) ) check_valid_parent() else: if minikind == b"a": # absent; should not occur anywhere else this_tree_map[file_id] = None, this_path elif minikind == b"r": # relocation, must occur at expected location this_tree_map[file_id] = tree_state[1], this_path else: this_tree_map[file_id] = this_path, this_path check_valid_parent() if absent_positions == tree_count: raise AssertionError(f"entry {entry!r} has no data for any tree.") if self._id_index is not None: for entry_key in self._id_index.iter_all(): # And that from this entry key, we can look up the original # record block_index, present = self._find_block_index_from_key(entry_key) if not present: raise AssertionError("missing block for entry key: %r", entry_key) _entry_index, present = self._find_entry_index( entry_key, self._dirblocks[block_index][1] ) if not present: raise AssertionError("missing entry for key: %r", entry_key) def _wipe_state(self): """Forget all state information about the dirstate.""" self._header_state = DirState.NOT_IN_MEMORY self._dirblock_state = DirState.NOT_IN_MEMORY self._changes_aborted = False self._parents = [] self._ghosts = [] self._dirblocks = [] self._id_index = None self._packed_stat_index = None self._end_of_header = None self._cutoff_time = None self._split_path_cache = {} def lock_read(self): """Acquire a read lock on the dirstate.""" if self._lock_token is not None: raise LockContention(self._lock_token) # TODO: jam 20070301 Rather than wiping completely, if the blocks are # already in memory, we could read just the header and check for # any modification. If not modified, we can just leave things # alone self._lock_token = lock.ReadLock(self._filename) self._lock_state = "r" self._state_file = self._lock_token.f self._wipe_state() return lock.LogicalLockResult(self.unlock) def lock_write(self): """Acquire a write lock on the dirstate.""" if self._lock_token is not None: raise LockContention(self._lock_token) # TODO: jam 20070301 Rather than wiping completely, if the blocks are # already in memory, we could read just the header and check for # any modification. If not modified, we can just leave things # alone self._lock_token = lock.WriteLock(self._filename) self._lock_state = "w" self._state_file = self._lock_token.f self._wipe_state() return lock.LogicalLockResult(self.unlock, self._lock_token) def unlock(self): """Drop any locks held on the dirstate.""" if self._lock_token is None: raise LockNotHeld(self) # TODO: jam 20070301 Rather than wiping completely, if the blocks are # already in memory, we could read just the header and check for # any modification. If not modified, we can just leave things # alone self._state_file = None self._lock_state = None self._lock_token.unlock() self._lock_token = None self._split_path_cache = {} def _requires_lock(self): """Check that a lock is currently held by someone on the dirstate.""" if not self._lock_token: raise ObjectNotLocked(self) def py_update_entry( state, entry, abspath, stat_value, _stat_to_minikind=DirState._stat_to_minikind ): """Update the entry based on what is actually on disk. This function only calculates the sha if it needs to - if the entry is uncachable, or clearly different to the first parent's entry, no sha is calculated, and None is returned. :param state: The dirstate this entry is in. :param entry: This is the dirblock entry for the file in question. :param abspath: The path on disk for this file. :param stat_value: The stat value done on the path. :return: None, or The sha1 hexdigest of the file (40 bytes) or link target of a symlink. """ try: minikind = _stat_to_minikind[stat_value.st_mode & 0o170000] except KeyError: # Unhandled kind return None packed_stat = pack_stat(stat_value) ( saved_minikind, saved_link_or_sha1, saved_file_size, saved_executable, saved_packed_stat, ) = entry[1][0] if not isinstance(saved_minikind, bytes): raise TypeError(saved_minikind) if minikind == b"d" and saved_minikind == b"t": minikind = b"t" if minikind == saved_minikind and packed_stat == saved_packed_stat: # The stat hasn't changed since we saved, so we can re-use the # saved sha hash. if minikind == b"d": return None # size should also be in packed_stat if saved_file_size == stat_value.st_size: return saved_link_or_sha1 # If we have gotten this far, that means that we need to actually # process this entry. link_or_sha1 = None worth_saving = True if minikind == b"f": executable = state._is_executable(stat_value.st_mode, saved_executable) if state._cutoff_time is None: state._sha_cutoff_time() if ( stat_value.st_mtime < state._cutoff_time and stat_value.st_ctime < state._cutoff_time and len(entry[1]) > 1 and entry[1][1][0] != b"a" ): # Could check for size changes for further optimised # avoidance of sha1's. However the most prominent case of # over-shaing is during initial add, which this catches. # Besides, if content filtering happens, size and sha # are calculated at the same time, so checking just the size # gains nothing w.r.t. performance. link_or_sha1 = state._sha1_file(abspath) entry[1][0] = ( b"f", link_or_sha1, stat_value.st_size, executable, packed_stat, ) else: entry[1][0] = (b"f", b"", stat_value.st_size, executable, DirState.NULLSTAT) worth_saving = False elif minikind == b"d": link_or_sha1 = None entry[1][0] = (b"d", b"", 0, False, packed_stat) if saved_minikind != b"d": # This changed from something into a directory. Make sure we # have a directory block for it. This doesn't happen very # often, so this doesn't have to be super fast. ( block_index, entry_index, _dir_present, _file_present, ) = state._get_block_entry_index(entry[0][0], entry[0][1], 0) state._ensure_block( block_index, entry_index, osutils.pathjoin(entry[0][0], entry[0][1]) ) else: worth_saving = False elif minikind == b"l": if saved_minikind == b"l": worth_saving = False link_or_sha1 = state._read_link(abspath, saved_link_or_sha1) if state._cutoff_time is None: state._sha_cutoff_time() if ( stat_value.st_mtime < state._cutoff_time and stat_value.st_ctime < state._cutoff_time ): entry[1][0] = (b"l", link_or_sha1, stat_value.st_size, False, packed_stat) else: entry[1][0] = (b"l", b"", stat_value.st_size, False, DirState.NULLSTAT) if worth_saving: state._mark_modified([entry]) return link_or_sha1 class ProcessEntryPython: """Python implementation for processing directory state entries.""" __slots__ = [ "include_unchanged", "last_source_parent", "last_target_parent", "new_dirname_to_file_id", "old_dirname_to_file_id", "partial", "search_specific_file_parents", "search_specific_files", "searched_exact_paths", "searched_specific_files", "seen_ids", "source_index", "state", "target_index", "tree", "use_filesystem_for_exec", "utf8_decode", "want_unversioned", ] def __init__( self, include_unchanged, use_filesystem_for_exec, search_specific_files, state, source_index, target_index, want_unversioned, tree, ): """Initialize the ProcessEntryPython. Args: include_unchanged: Whether to include unchanged entries. use_filesystem_for_exec: Whether to use filesystem for executable checks. search_specific_files: Specific files to search for. state: The dirstate being processed. source_index: Index of the source tree. target_index: Index of the target tree. want_unversioned: Whether to include unversioned files. tree: The tree object. """ self.old_dirname_to_file_id = {} self.new_dirname_to_file_id = {} # Are we doing a partial iter_changes? self.partial = search_specific_files != {""} # Using a list so that we can access the values and change them in # nested scope. Each one is [path, file_id, entry] self.last_source_parent = [None, None] self.last_target_parent = [None, None] self.include_unchanged = include_unchanged self.use_filesystem_for_exec = use_filesystem_for_exec self.utf8_decode = codecs.utf_8_decode # for all search_indexs in each path at or under each element of # search_specific_files, if the detail is relocated: add the id, and # add the relocated path as one to search if its not searched already. # If the detail is not relocated, add the id. self.searched_specific_files = set() # When we search exact paths without expanding downwards, we record # that here. self.searched_exact_paths = set() self.search_specific_files = search_specific_files # The parents up to the root of the paths we are searching. # After all normal paths are returned, these specific items are returned. self.search_specific_file_parents = set() # The ids we've sent out in the delta. self.seen_ids = set() self.state = state self.source_index = source_index self.target_index = target_index if target_index != 0: # A lot of code in here depends on target_index == 0 raise BzrFormatsError("unsupported target index") self.want_unversioned = want_unversioned self.tree = tree def _process_entry(self, entry, path_info, pathjoin=osutils.pathjoin): """Compare an entry and real disk to generate delta information. :param path_info: top_relpath, basename, kind, lstat, abspath for the path of entry. If None, then the path is considered absent in the target (Perhaps we should pass in a concrete entry for this ?) Basename is returned as a utf8 string because we expect this tuple will be ignored, and don't want to take the time to decode. :return: (iter_changes_result, changed). If the entry has not been handled then changed is None. Otherwise it is False if no content or metadata changes have occurred, and True if any content or metadata change has occurred. If self.include_unchanged is True then if changed is not None, iter_changes_result will always be a result tuple. Otherwise, iter_changes_result is None unless changed is True. """ if self.source_index is None: source_details = DirState.NULL_PARENT_DETAILS else: source_details = entry[1][self.source_index] # GZ 2017-06-09: Eck, more sets. _fdltr = {b"f", b"d", b"l", b"t", b"r"} _fdlt = {b"f", b"d", b"l", b"t"} _ra = (b"r", b"a") target_details = entry[1][self.target_index] target_minikind = target_details[0] if path_info is not None and target_minikind in _fdlt: if not (self.target_index == 0): raise AssertionError() link_or_sha1 = update_entry( self.state, entry, abspath=path_info[4], stat_value=path_info[3] ) # The entry may have been modified by update_entry target_details = entry[1][self.target_index] target_minikind = target_details[0] else: link_or_sha1 = None file_id = entry[0][2] source_minikind = source_details[0] if source_minikind in _fdltr and target_minikind in _fdlt: # claimed content in both: diff # r | fdlt | | add source to search, add id path move and perform # | | | diff check on source-target # r | fdlt | a | dangling file that was present in the basis. # | | | ??? if source_minikind == b"r": # add the source to the search path to find any children it # has. TODO ? : only add if it is a container ? if not is_inside_any(self.searched_specific_files, source_details[1]): self.search_specific_files.add(source_details[1]) # generate the old path; this is needed for stating later # as well. old_path = source_details[1] old_dirname, old_basename = os.path.split(old_path) path = pathjoin(entry[0][0], entry[0][1]) old_entry = self.state._get_entry(self.source_index, path_utf8=old_path) # update the source details variable to be the real # location. if old_entry == (None, None): raise DirstateCorrupt( self.state._filename, "entry '{}/{}' is considered renamed from {!r}" " but source does not exist\n" "entry: {}".format(entry[0][0], entry[0][1], old_path, entry), ) source_details = old_entry[1][self.source_index] source_minikind = source_details[0] else: old_dirname = entry[0][0] old_basename = entry[0][1] old_path = path = None if path_info is None: # the file is missing on disk, show as removed. content_change = True target_kind = None target_exec = False else: # source and target are both versioned and disk file is present. target_kind = path_info[2] if target_kind == "directory": if path is None: old_path = path = pathjoin(old_dirname, old_basename) self.new_dirname_to_file_id[path] = file_id if source_minikind != b"d": content_change = True else: # directories have no fingerprint content_change = False target_exec = False elif target_kind == "file": if source_minikind != b"f": content_change = True else: # Check the sha. We can't just rely on the size as # content filtering may mean differ sizes actually # map to the same content if link_or_sha1 is None: # Stat cache miss: ( statvalue, link_or_sha1, ) = self.state._sha1_provider.stat_and_sha1(path_info[4]) self.state._observed_sha1(entry, link_or_sha1, statvalue) content_change = link_or_sha1 != source_details[1] # Target details is updated at update_entry time if self.use_filesystem_for_exec: # We don't need S_ISREG here, because we are sure # we are dealing with a file. target_exec = bool(stat.S_IEXEC & path_info[3].st_mode) else: target_exec = target_details[3] elif target_kind == "symlink": if source_minikind != b"l": content_change = True else: content_change = link_or_sha1 != source_details[1] target_exec = False elif target_kind == "tree-reference": content_change = source_minikind != b"t" target_exec = False else: if path is None: path = pathjoin(old_dirname, old_basename) raise BadFileKindError(path, path_info[2]) if source_minikind == b"d": if path is None: old_path = path = pathjoin(old_dirname, old_basename) self.old_dirname_to_file_id[old_path] = file_id # parent id is the entry for the path in the target tree if old_basename and old_dirname == self.last_source_parent[0]: source_parent_id = self.last_source_parent[1] else: try: source_parent_id = self.old_dirname_to_file_id[old_dirname] except KeyError: source_parent_entry = self.state._get_entry( self.source_index, path_utf8=old_dirname ) source_parent_id = source_parent_entry[0][2] if source_parent_id == entry[0][2]: # This is the root, so the parent is None source_parent_id = None else: self.last_source_parent[0] = old_dirname self.last_source_parent[1] = source_parent_id new_dirname = entry[0][0] if entry[0][1] and new_dirname == self.last_target_parent[0]: target_parent_id = self.last_target_parent[1] else: try: target_parent_id = self.new_dirname_to_file_id[new_dirname] except KeyError as e: # TODO: We don't always need to do the lookup, because the # parent entry will be the same as the source entry. target_parent_entry = self.state._get_entry( self.target_index, path_utf8=new_dirname ) if target_parent_entry == (None, None): raise AssertionError( f"Could not find target parent in wt: {new_dirname}\nparent of: {entry}" ) from e target_parent_id = target_parent_entry[0][2] if target_parent_id == entry[0][2]: # This is the root, so the parent is None target_parent_id = None else: self.last_target_parent[0] = new_dirname self.last_target_parent[1] = target_parent_id source_exec = source_details[3] changed = ( content_change or source_parent_id != target_parent_id or old_basename != entry[0][1] or source_exec != target_exec ) if not changed and not self.include_unchanged: return None, False else: if old_path is None: old_path = path = pathjoin(old_dirname, old_basename) old_path_u = self.utf8_decode(old_path, "surrogateescape")[0] path_u = old_path_u else: old_path_u = self.utf8_decode(old_path, "surrogateescape")[0] if old_path == path: path_u = old_path_u else: path_u = self.utf8_decode(path, "surrogateescape")[0] source_kind = DirState._minikind_to_kind[source_minikind] return DirstateInventoryChange( entry[0][2], (old_path_u, path_u), content_change, (True, True), (source_parent_id, target_parent_id), ( self.utf8_decode(old_basename, "surrogateescape")[0], self.utf8_decode(entry[0][1], "surrogateescape")[0], ), (source_kind, target_kind), (source_exec, target_exec), ), changed elif source_minikind in b"a" and target_minikind in _fdlt: # looks like a new file path = pathjoin(entry[0][0], entry[0][1]) # parent id is the entry for the path in the target tree # TODO: these are the same for an entire directory: cache em. parent_id = self.state._get_entry(self.target_index, path_utf8=entry[0][0])[ 0 ][2] if parent_id == entry[0][2]: parent_id = None if path_info is not None: # Present on disk: if self.use_filesystem_for_exec: # We need S_ISREG here, because we aren't sure if this # is a file or not. target_exec = bool( stat.S_ISREG(path_info[3].st_mode) and stat.S_IEXEC & path_info[3].st_mode ) else: target_exec = target_details[3] return DirstateInventoryChange( entry[0][2], (None, self.utf8_decode(path, "surrogateescape")[0]), True, (False, True), (None, parent_id), (None, self.utf8_decode(entry[0][1], "surrogateescape")[0]), (None, path_info[2]), (None, target_exec), ), True else: # Its a missing file, report it as such. return DirstateInventoryChange( entry[0][2], (None, self.utf8_decode(path, "surrogateescape")[0]), False, (False, True), (None, parent_id), (None, self.utf8_decode(entry[0][1], "surrogateescape")[0]), (None, None), (None, False), ), True elif source_minikind in _fdlt and target_minikind in b"a": # unversioned, possibly, or possibly not deleted: we dont care. # if its still on disk, *and* theres no other entry at this # path [we dont know this in this routine at the moment - # perhaps we should change this - then it would be an unknown. old_path = pathjoin(entry[0][0], entry[0][1]) # parent id is the entry for the path in the target tree parent_id = self.state._get_entry(self.source_index, path_utf8=entry[0][0])[ 0 ][2] if parent_id == entry[0][2]: parent_id = None return DirstateInventoryChange( entry[0][2], (self.utf8_decode(old_path, "surrogateescape")[0], None), True, (True, False), (parent_id, None), (self.utf8_decode(entry[0][1], "surrogateescape")[0], None), (DirState._minikind_to_kind[source_minikind], None), (source_details[3], None), ), True elif source_minikind in _fdlt and target_minikind in b"r": # a rename; could be a true rename, or a rename inherited from # a renamed parent. TODO: handle this efficiently. Its not # common case to rename dirs though, so a correct but slow # implementation will do. if not is_inside_any(self.searched_specific_files, target_details[1]): self.search_specific_files.add(target_details[1]) elif source_minikind in _ra and target_minikind in _ra: # neither of the selected trees contain this file, # so skip over it. This is not currently directly tested, but # is indirectly via test_too_much.TestCommands.test_conflicts. pass else: raise AssertionError( "don't know how to compare " f"source_minikind={source_minikind!r}, target_minikind={target_minikind!r}" ) return None, None def __iter__(self): """Return iterator for processing entries.""" return self def _gather_result_for_consistency(self, result): """Check a result we will yield to make sure we are consistent later. This gathers result's parents into a set to output later. :param result: A result tuple. """ if not self.partial or not result.file_id: return self.seen_ids.add(result.file_id) new_path = result.path[1] if new_path: # Not the root and not a delete: queue up the parents of the path. self.search_specific_file_parents.update( p.encode("utf8", "surrogateescape") for p in parent_directories(new_path) ) # Add the root directory which parent_directories does not # provide. self.search_specific_file_parents.add(b"") def iter_changes(self): """Iterate over the changes.""" utf8_decode = codecs.utf_8_decode _lt_by_dirs = lt_by_dirs _process_entry = self._process_entry search_specific_files = self.search_specific_files searched_specific_files = self.searched_specific_files splitpath = osutils.splitpath # sketch: # compare source_index and target_index at or under each element of search_specific_files. # follow the following comparison table. Note that we only want to do diff operations when # the target is fdl because thats when the walkdirs logic will have exposed the pathinfo # for the target. # cases: # # Source | Target | disk | action # r | fdlt | | add source to search, add id path move and perform # | | | diff check on source-target # r | fdlt | a | dangling file that was present in the basis. # | | | ??? # r | a | | add source to search # r | a | a | # r | r | | this path is present in a non-examined tree, skip. # r | r | a | this path is present in a non-examined tree, skip. # a | fdlt | | add new id # a | fdlt | a | dangling locally added file, skip # a | a | | not present in either tree, skip # a | a | a | not present in any tree, skip # a | r | | not present in either tree at this path, skip as it # | | | may not be selected by the users list of paths. # a | r | a | not present in either tree at this path, skip as it # | | | may not be selected by the users list of paths. # fdlt | fdlt | | content in both: diff them # fdlt | fdlt | a | deleted locally, but not unversioned - show as deleted ? # fdlt | a | | unversioned: output deleted id for now # fdlt | a | a | unversioned and deleted: output deleted id # fdlt | r | | relocated in this tree, so add target to search. # | | | Dont diff, we will see an r,fd; pair when we reach # | | | this id at the other path. # fdlt | r | a | relocated in this tree, so add target to search. # | | | Dont diff, we will see an r,fd; pair when we reach # | | | this id at the other path. # TODO: jam 20070516 - Avoid the _get_entry lookup overhead by # keeping a cache of directories that we have seen. while search_specific_files: # TODO: the pending list should be lexically sorted? the # interface doesn't require it. current_root = search_specific_files.pop() current_root_unicode = current_root.decode("utf8") searched_specific_files.add(current_root) # process the entries for this containing directory: the rest will be # found by their parents recursively. root_entries = self.state._entries_for_path(current_root) root_abspath = self.tree.abspath(current_root_unicode) try: root_stat = os.lstat(root_abspath) except FileNotFoundError: # the path does not exist: let _process_entry know that. root_dir_info = None else: root_dir_info = ( b"", current_root, osutils.file_kind_from_stat_mode(root_stat.st_mode), root_stat, root_abspath, ) if root_dir_info[2] == "directory": if self.tree._directory_is_tree_reference( current_root.decode("utf8") ): root_dir_info = ( root_dir_info[:2] + ("tree-reference",) + root_dir_info[3:] ) if not root_entries and not root_dir_info: # this specified path is not present at all, skip it. continue path_handled = False for entry in root_entries: result, changed = _process_entry(entry, root_dir_info) if changed is not None: path_handled = True if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: yield result if self.want_unversioned and not path_handled and root_dir_info: new_executable = bool( stat.S_ISREG(root_dir_info[3].st_mode) and stat.S_IEXEC & root_dir_info[3].st_mode ) yield DirstateInventoryChange( None, (None, current_root_unicode), True, (False, False), (None, None), (None, splitpath(current_root_unicode)[-1]), (None, root_dir_info[2]), (None, new_executable), ) initial_key = (current_root, b"", b"") block_index, _ = self.state._find_block_index_from_key(initial_key) if block_index == 0: # we have processed the total root already, but because the # initial key matched it we should skip it here. block_index += 1 if root_dir_info and root_dir_info[2] == "tree-reference": current_dir_info = None else: dir_iterator = _walkdirs_utf8(root_abspath, prefix=current_root) try: current_dir_info = next(dir_iterator) except (FileNotFoundError, NotADirectoryError, ValueError): current_dir_info = None else: if current_dir_info[0][0] == b"": # remove .bzr from iteration bzr_index = bisect.bisect_left(current_dir_info[1], (b".bzr",)) if current_dir_info[1][bzr_index][0] != b".bzr": raise AssertionError() del current_dir_info[1][bzr_index] # walk until both the directory listing and the versioned metadata # are exhausted. if block_index < len(self.state._dirblocks) and is_inside( current_root, self.state._dirblocks[block_index][0] ): current_block = self.state._dirblocks[block_index] else: current_block = None while current_dir_info is not None or current_block is not None: if ( current_dir_info and current_block and current_dir_info[0][0] != current_block[0] ): if _lt_by_dirs(current_dir_info[0][0], current_block[0]): # filesystem data refers to paths not covered by the dirblock. # this has two possibilities: # A) it is versioned but empty, so there is no block for it # B) it is not versioned. # if (A) then we need to recurse into it to check for # new unknown files or directories. # if (B) then we should ignore it, because we don't # recurse into unknown directories. path_index = 0 while path_index < len(current_dir_info[1]): current_path_info = current_dir_info[1][path_index] if self.want_unversioned: if current_path_info[2] == "directory": if self.tree._directory_is_tree_reference( current_path_info[0].decode("utf8") ): current_path_info = ( current_path_info[:2] + ("tree-reference",) + current_path_info[3:] ) new_executable = bool( stat.S_ISREG(current_path_info[3].st_mode) and stat.S_IEXEC & current_path_info[3].st_mode ) yield DirstateInventoryChange( None, ( None, utf8_decode( current_path_info[0], "surrogateescape" )[0], ), True, (False, False), (None, None), ( None, utf8_decode( current_path_info[1], "surrogateescape" )[0], ), (None, current_path_info[2]), (None, new_executable), ) # dont descend into this unversioned path if it is # a dir if current_path_info[2] in ("directory", "tree-reference"): del current_dir_info[1][path_index] path_index -= 1 path_index += 1 # This dir info has been handled, go to the next try: current_dir_info = next(dir_iterator) except StopIteration: current_dir_info = None else: # We have a dirblock entry for this location, but there # is no filesystem path for this. This is most likely # because a directory was removed from the disk. # We don't have to report the missing directory, # because that should have already been handled, but we # need to handle all of the files that are contained # within. for current_entry in current_block[1]: # entry referring to file not present on disk. # advance the entry only, after processing. result, changed = _process_entry(current_entry, None) if changed is not None: if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: yield result block_index += 1 if block_index < len(self.state._dirblocks) and is_inside( current_root, self.state._dirblocks[block_index][0] ): current_block = self.state._dirblocks[block_index] else: current_block = None continue entry_index = 0 if current_block and entry_index < len(current_block[1]): current_entry = current_block[1][entry_index] else: current_entry = None advance_entry = True path_index = 0 if current_dir_info and path_index < len(current_dir_info[1]): current_path_info = current_dir_info[1][path_index] if current_path_info[2] == "directory": if self.tree._directory_is_tree_reference( current_path_info[0].decode("utf8") ): current_path_info = ( current_path_info[:2] + ("tree-reference",) + current_path_info[3:] ) else: current_path_info = None advance_path = True path_handled = False while current_entry is not None or current_path_info is not None: if current_entry is None: # the check for path_handled when the path is advanced # will yield this path if needed. pass elif current_path_info is None: # no path is fine: the per entry code will handle it. result, changed = _process_entry( current_entry, current_path_info ) if changed is not None: if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: yield result elif current_entry[0][1] != current_path_info[1] or current_entry[ 1 ][self.target_index][0] in (b"a", b"r"): # The current path on disk doesn't match the dirblock # record. Either the dirblock is marked as absent, or # the file on disk is not present at all in the # dirblock. Either way, report about the dirblock # entry, and let other code handle the filesystem one. # Compare the basename for these files to determine # which comes first if current_path_info[1] < current_entry[0][1]: # extra file on disk: pass for now, but only # increment the path, not the entry advance_entry = False else: # entry referring to file not present on disk. # advance the entry only, after processing. result, changed = _process_entry(current_entry, None) if changed is not None: if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: yield result advance_path = False else: result, changed = _process_entry( current_entry, current_path_info ) if changed is not None: path_handled = True if changed: self._gather_result_for_consistency(result) if changed or self.include_unchanged: yield result if advance_entry and current_entry is not None: entry_index += 1 if entry_index < len(current_block[1]): current_entry = current_block[1][entry_index] else: current_entry = None else: advance_entry = True # reset the advance flaga if advance_path and current_path_info is not None: if not path_handled: # unversioned in all regards if self.want_unversioned: new_executable = bool( stat.S_ISREG(current_path_info[3].st_mode) and stat.S_IEXEC & current_path_info[3].st_mode ) relpath_unicode = utf8_decode( current_path_info[0], "surrogateescape" )[0] yield DirstateInventoryChange( None, (None, relpath_unicode), True, (False, False), (None, None), ( None, utf8_decode( current_path_info[1], "surrogateescape" )[0], ), (None, current_path_info[2]), (None, new_executable), ) # dont descend into this unversioned path if it is # a dir if current_path_info[2] in ("directory"): del current_dir_info[1][path_index] path_index -= 1 # dont descend the disk iterator into any tree # paths. if current_path_info[2] == "tree-reference": del current_dir_info[1][path_index] path_index -= 1 path_index += 1 if path_index < len(current_dir_info[1]): current_path_info = current_dir_info[1][path_index] if current_path_info[2] == "directory": if self.tree._directory_is_tree_reference( current_path_info[0].decode("utf8") ): current_path_info = ( current_path_info[:2] + ("tree-reference",) + current_path_info[3:] ) else: current_path_info = None path_handled = False else: advance_path = True # reset the advance flagg. if current_block is not None: block_index += 1 if block_index < len(self.state._dirblocks) and is_inside( current_root, self.state._dirblocks[block_index][0] ): current_block = self.state._dirblocks[block_index] else: current_block = None if current_dir_info is not None: try: current_dir_info = next(dir_iterator) except StopIteration: current_dir_info = None for result in self._iter_specific_file_parents(): yield result def _iter_specific_file_parents(self): """Iter over the specific file parents.""" while self.search_specific_file_parents: # Process the parent directories for the paths we were iterating. # Even in extremely large trees this should be modest, so currently # no attempt is made to optimise. path_utf8 = self.search_specific_file_parents.pop() if is_inside_any(self.searched_specific_files, path_utf8): # We've examined this path. continue if path_utf8 in self.searched_exact_paths: # We've examined this path. continue path_entries = self.state._entries_for_path(path_utf8) # We need either one or two entries. If the path in # self.target_index has moved (so the entry in source_index is in # 'ar') then we need to also look for the entry for this path in # self.source_index, to output the appropriate delete-or-rename. selected_entries = [] found_item = False for candidate_entry in path_entries: # Find entries present in target at this path: if candidate_entry[1][self.target_index][0] not in (b"a", b"r"): found_item = True selected_entries.append(candidate_entry) # Find entries present in source at this path: elif self.source_index is not None and candidate_entry[1][ self.source_index ][0] not in (b"a", b"r"): found_item = True if candidate_entry[1][self.target_index][0] == b"a": # Deleted, emit it here. selected_entries.append(candidate_entry) else: # renamed, emit it when we process the directory it # ended up at. self.search_specific_file_parents.add( candidate_entry[1][self.target_index][1] ) if not found_item: raise AssertionError( "Missing entry for specific path parent {!r}, {!r}".format( path_utf8, path_entries ) ) path_info = self._path_info(path_utf8, path_utf8.decode("utf8")) for entry in selected_entries: if entry[0][2] in self.seen_ids: continue result, changed = self._process_entry(entry, path_info) if changed is None: raise AssertionError( "Got entry<->path mismatch for specific path " f"{path_utf8!r} entry {entry!r} path_info {path_info!r} " ) # Only include changes - we're outside the users requested # expansion. if changed: self._gather_result_for_consistency(result) if result.kind[0] == "directory" and result.kind[1] != "directory": # This stopped being a directory, the old children have # to be included. if entry[1][self.source_index][0] == b"r": # renamed, take the source path entry_path_utf8 = entry[1][self.source_index][1] else: entry_path_utf8 = path_utf8 initial_key = (entry_path_utf8, b"", b"") block_index, _ = self.state._find_block_index_from_key( initial_key ) if block_index == 0: # The children of the root are in block index 1. block_index += 1 current_block = None if block_index < len(self.state._dirblocks): current_block = self.state._dirblocks[block_index] if not is_inside(entry_path_utf8, current_block[0]): # No entries for this directory at all. current_block = None if current_block is not None: for entry in current_block[1]: if entry[1][self.source_index][0] in (b"a", b"r"): # Not in the source tree, so doesn't have to be # included. continue # Path of the entry itself. self.search_specific_file_parents.add( osutils.pathjoin(*entry[0][:2]) ) if changed or self.include_unchanged: yield result self.searched_exact_paths.add(path_utf8) def _path_info(self, utf8_path, unicode_path): """Generate path_info for unicode_path. :return: None if unicode_path does not exist, or a path_info tuple. """ abspath = self.tree.abspath(unicode_path) try: stat = os.lstat(abspath) except FileNotFoundError: # the path does not exist. return None utf8_basename = utf8_path.rsplit(b"/", 1)[-1] dir_info = ( utf8_path, utf8_basename, osutils.file_kind_from_stat_mode(stat.st_mode), stat, abspath, ) if dir_info[2] == "directory": if self.tree._directory_is_tree_reference(unicode_path): self.root_dir_info = ( self.root_dir_info[:2] + ("tree-reference",) + self.root_dir_info[3:] ) return dir_info from ._bzr_rs import dirstate as _dirstate_rs DefaultSHA1Provider = _dirstate_rs.DefaultSHA1Provider bisect_dirblock = _dirstate_rs.bisect_dirblock bisect_path_left = _dirstate_rs.bisect_path_left bisect_path_right = _dirstate_rs.bisect_path_right lt_by_dirs = _dirstate_rs.lt_by_dirs lt_path_by_dirblock = _dirstate_rs.lt_path_by_dirblock pack_stat = _dirstate_rs.pack_stat _fields_per_entry = _dirstate_rs.fields_per_entry _get_ghosts_line = _dirstate_rs.get_ghosts_line _get_parents_line = _dirstate_rs.get_parents_line IdIndex = _dirstate_rs.IdIndex _inv_entry_to_details = _dirstate_rs.inv_entry_to_details _get_output_lines = _dirstate_rs.get_output_lines # Try to load the compiled form if possible try: from ._dirstate_helpers_pyx import ProcessEntryC as _process_entry # noqa: N813 from ._dirstate_helpers_pyx import _read_dirblocks from ._dirstate_helpers_pyx import update_entry as update_entry except ModuleNotFoundError as e: osutils.failed_to_load_extension(e) from ._dirstate_helpers_py import _read_dirblocks # FIXME: It would be nice to be able to track moved lines so that the # corresponding python code can be moved to the _dirstate_helpers_py # module. I don't want to break the history for this important piece of # code so I left the code here -- vila 20090622 update_entry = py_update_entry _process_entry = ProcessEntryPython bzrformats_3.4.0.orig/bzrformats/errors.py0000644000000000000000000004612215162115107015741 0ustar00# Copyright (C) 2025 Breezy Contributors # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Errors specific to bzrformats.""" class BzrFormatsError(Exception): """Base class for errors raised by bzrformats. Attributes: internal_error: if True this was probably caused by a brz bug and should be displayed with a traceback; if False (or absent) this was probably a user or environment error and they don't need the gory details. (That can be overridden by -Derror on the command line.) _fmt: Format string to display the error; this is expanded by the instance's dict. """ internal_error = False def __init__(self, msg=None, **kwds): """Construct a new BzrFormatsError. There are two alternative forms for constructing these objects. Either a preformatted string may be passed, or a set of named arguments can be given. The first is for generic "user" errors which are not intended to be caught and so do not need a specific subclass. The second case is for use with subclasses that provide a _fmt format string to print the arguments. Keyword arguments are taken as parameters to the error, which can be inserted into the format string template. It's recommended that subclasses override the __init__ method to require specific parameters. Args: msg: If given, this is the literal complete text for the error, not subject to expansion. 'msg' is used instead of 'message' because python evolved and, in 2.6, forbids the use of 'message'. """ Exception.__init__(self) if msg is not None: # I was going to deprecate this, but it actually turns out to be # quite handy - mbp 20061103. self._preformatted_string = msg else: self._preformatted_string = None for key, value in kwds.items(): setattr(self, key, value) def _format(self): s = getattr(self, "_preformatted_string", None) if s is not None: # contains a preformatted message return s err = None try: fmt = self._get_format_string() if fmt: d = dict(self.__dict__) s = fmt % d # __str__() should always return a 'str' object # never a 'unicode' object. return s except Exception as e: err = e return "Unprintable exception {}: dict={!r}, fmt={!r}, error={!r}".format( self.__class__.__name__, self.__dict__, getattr(self, "_fmt", None), err ) __str__ = _format def __repr__(self): """Return a string representation of this error.""" return f"{self.__class__.__name__}({self!s})" def _get_format_string(self): """Return format string for this exception or None.""" return getattr(self, "_fmt", None) def __eq__(self, other): """Return True if this error equals other.""" if self.__class__ is not other.__class__: return NotImplemented return self.__dict__ == other.__dict__ def __hash__(self): """Return a hash based on object identity.""" return id(self) class UnexpectedInventoryFormat(BzrFormatsError): """Unexpected inventory format encountered.""" _fmt = "Unexpected inventory format: %(msg)s" def __init__(self, msg): """Initialize with the unexpected format message.""" super().__init__() self.msg = msg class UnsupportedInventoryKind(BzrFormatsError): """Unsupported inventory kind encountered.""" _fmt = "Unsupported inventory kind: %(kind)s" def __init__(self, kind): """Initialize with the unsupported kind.""" super().__init__() self.kind = kind class KnitCorrupt(BzrFormatsError): """A knit file is corrupt.""" _fmt = "Knit %(knit)s corrupt: %(how)s" def __init__(self, knit, how): """Initialize with the knit and corruption description.""" super().__init__() self.knit = knit self.how = how class KnitDataStreamIncompatible(BzrFormatsError): """Cannot insert knit data stream due to incompatibility.""" _fmt = "Cannot insert knit data stream for %(key)s: %(msg)s" def __init__(self, key, msg): """Initialize with the key and incompatibility message.""" super().__init__() self.key = key self.msg = msg class KnitDataStreamUnknown(BzrFormatsError): """Unknown knit data stream type.""" _fmt = "Unknown knit data stream for %(key)s" def __init__(self, key): """Initialize with the key of the unknown stream.""" super().__init__() self.key = key class KnitHeaderError(BzrFormatsError): """A knit file has an invalid header.""" _fmt = "Knit header error: %(badline)r" def __init__(self, badline): """Initialize with the bad header line.""" super().__init__() self.badline = badline class DirstateCorrupt(BzrFormatsError): """The dirstate file appears to be corrupt.""" _fmt = "The dirstate file (%(state)s) appears to be corrupt: %(msg)s" def __init__(self, state, msg): """Initialize with the state file path and corruption message.""" super().__init__() self.state = state self.msg = msg # Index errors class BadIndexFormatSignature(BzrFormatsError): """Value is not an index of the expected type.""" _fmt = "%(value)s is not an index of type %(_type)s." def __init__(self, value, _type): """Initialize.""" super().__init__() self.value = value self._type = _type class BadIndexData(BzrFormatsError): """Error in data for an index.""" _fmt = "Error in data for index %(value)s." def __init__(self, value): """Initialize.""" super().__init__() self.value = value class BadIndexDuplicateKey(BzrFormatsError): """A key is already present in the index.""" _fmt = "The key '%(key)s' is already in index '%(index)s'." def __init__(self, key, index): """Initialize.""" super().__init__() self.key = key self.index = index class BadIndexKey(BzrFormatsError): """A key is not valid for an index.""" _fmt = "The key '%(key)s' is not a valid key." def __init__(self, key): """Initialize.""" super().__init__() self.key = key class BadIndexOptions(BzrFormatsError): """Could not parse options for an index.""" _fmt = "Could not parse options for index %(value)s." def __init__(self, value): """Initialize.""" super().__init__() self.value = value class BadIndexValue(BzrFormatsError): """A value is not valid for an index.""" _fmt = "The value '%(value)s' is not a valid value." def __init__(self, value): """Initialize.""" super().__init__() self.value = value # Inventory errors class InvalidEntryName(BzrFormatsError): """Invalid entry name.""" _fmt = "Invalid entry name: %(name)s" def __init__(self, name): """Initialize.""" super().__init__() self.name = name class DuplicateFileId(BzrFormatsError): """File ID already exists in inventory.""" _fmt = "File id {%(file_id)s} already exists in inventory as %(entry)s" def __init__(self, file_id, entry): """Initialize.""" super().__init__() self.file_id = file_id self.entry = entry # Groupcompress errors class DecompressCorruption(BzrFormatsError): """Corruption while decompressing repository file.""" _fmt = "Corruption while decompressing repository file%(orig_error)s" def __init__(self, orig_error=""): """Initialize.""" if orig_error: self.orig_error = f", {orig_error}" else: self.orig_error = "" # Versioned file errors class VersionedFileError(BzrFormatsError): """Base class for versioned file errors. Raised when operations on versioned files encounter problems. """ _fmt = "Versioned file error" class RevisionNotPresent(VersionedFileError): """Revision not present in versioned file. Raised when attempting to access a revision that does not exist in the specified versioned file. """ _fmt = 'Revision {%(revision_id)s} not present in "%(file_id)s".' def __init__(self, revision_id, file_id): """Initialize with revision and file information. Args: revision_id: The revision ID that was not found. file_id: The file ID where the revision was not found. """ super().__init__() self.revision_id = revision_id self.file_id = file_id class RevisionAlreadyPresent(VersionedFileError): """Revision already present in versioned file. Raised when attempting to add a revision that already exists in the specified versioned file. """ _fmt = 'Revision {%(revision_id)s} already present in "%(file_id)s".' def __init__(self, revision_id, file_id): """Initialize with revision and file information. Args: revision_id: The revision ID that is already present. file_id: The file ID where the revision already exists. """ super().__init__() self.revision_id = revision_id self.file_id = file_id class InvalidRevisionId(BzrFormatsError): """Invalid revision ID specified. Raised when a revision ID is not valid or not found in the branch. """ _fmt = "Invalid revision-id {%(revision_id)s} in %(branch)s" def __init__(self, revision_id, branch): """Initialize with the invalid revision ID and branch. Args: revision_id: The invalid revision ID. branch: The branch where the revision ID was not found. """ super().__init__() self.revision_id = revision_id self.branch = branch class UnavailableRepresentation(BzrFormatsError): """Requested representation encoding is not available for a key.""" _fmt = ( "The encoding '%(wanted)s' is not available for key %(key)s which " "is encoded as '%(native)s'." ) def __init__(self, key, wanted, native): """Initialize.""" super().__init__() self.wanted = wanted self.native = native self.key = key class ExistingContent(BzrFormatsError): """The content being inserted is already present.""" _fmt = "The content being inserted is already present." # Weave errors class WeaveError(BzrFormatsError): """Error in processing weave.""" _fmt = "Error in processing weave" class WeaveRevisionAlreadyPresent(WeaveError): """Revision already present in weave.""" _fmt = "Revision {%(revision_id)s} already present in weave" def __init__(self, revision_id): """Initialize.""" super().__init__() self.revision_id = revision_id class WeaveRevisionNotPresent(WeaveError): """Revision not present in weave.""" _fmt = "Revision {%(revision_id)s} not present in weave" def __init__(self, revision_id): """Initialize.""" super().__init__() self.revision_id = revision_id class WeaveFormatError(WeaveError): """Weave invariant violated.""" _fmt = "Weave invariant violated: %(what)s" def __init__(self, what): """Initialize.""" super().__init__() self.what = what class WeaveParentMismatch(WeaveError): """Parents are mismatched between two revisions.""" _fmt = "Parents are mismatched between two revisions. %(message)s" class WeaveInvalidChecksum(WeaveError): """Text did not match its checksum in the weave.""" _fmt = "Text did not match it's checksum: %(message)s" class WeaveTextDiffers(WeaveError): """Weaves differ on text content for a revision.""" _fmt = ( "Weaves differ on text content. Revision:" " {%(revision_id)s}, %(weave_a)s, %(weave_b)s" ) def __init__(self, revision_id, weave_a, weave_b): """Initialize.""" super().__init__() self.revision_id = revision_id self.weave_a = weave_a self.weave_b = weave_b # Serializer errors class BadInventoryFormat(BzrFormatsError): """Inventory XML has an unexpected root tag.""" _fmt = "Root tag is %(tag)r" def __init__(self, tag): """Initialize.""" super().__init__() self.tag = tag class ReservedId(BzrFormatsError): """A revision ID that is reserved for internal use was encountered.""" _fmt = "Reserved revision-id {%(revision_id)s}" def __init__(self, revision_id): """Initialize.""" super().__init__() self.revision_id = revision_id class BadFileKindError(BzrFormatsError): """Cannot operate on file of unsupported kind. Raised when attempting to perform an operation on a file whose type (kind) is not supported by the current operation. """ _fmt = "Cannot operate on %(filename)s of unsupported kind %(kind)s" def __init__(self, filename, kind): """Create a BadFileKindError. Args: filename: Path to the file with unsupported kind. kind: The unsupported file kind. """ super().__init__() self.filename = filename self.kind = kind # Transport-related errors class PathError(BzrFormatsError): """Base class for path-related errors.""" _fmt = "Path error: %(path)r%(extra)s" def __init__(self, path, extra=None): """Initialize.""" super().__init__() self.path = path if extra: self.extra = ": " + str(extra) else: self.extra = "" class NoSuchFile(PathError): """Exception raised when a file or directory does not exist. This is the standard exception raised by transports when attempting to access a non-existent file or directory. """ _fmt = "No such file: %(path)r%(extra)s" class VersionedFileInvalidChecksum(VersionedFileError): """Text checksum validation failed. Raised when the checksum of text in a versioned file does not match the expected checksum, indicating data corruption. """ _fmt = "Text did not match its checksum: %(msg)s" class InconsistentDelta(BzrFormatsError): """Used when we get a delta that is not valid.""" _fmt = ( "An inconsistent delta was supplied involving %(path)r," " %(file_id)r\nreason: %(reason)s" ) def __init__(self, path, file_id, reason): """Initialize with delta inconsistency details. Args: path: The path involved in the inconsistent delta. file_id: The file ID involved in the inconsistent delta. reason: The reason why the delta is inconsistent. """ super().__init__() self.path = path self.file_id = file_id self.reason = reason class InconsistentDeltaDelta(InconsistentDelta): """Used when we get a delta that is not valid.""" _fmt = "An inconsistent delta was supplied: %(delta)r\nreason: %(reason)s" def __init__(self, delta, reason): """Initialize with delta and inconsistency reason. Args: delta: The inconsistent delta. reason: The reason why the delta is inconsistent. """ BzrFormatsError.__init__(self) self.delta = delta self.reason = reason class InternalBzrFormatsError(BzrFormatsError): """Base class for errors that indicate a bug in bzrformats.""" internal_error = True class BzrCheckError(InternalBzrFormatsError): """Internal check failed.""" _fmt = "Internal check failed: %(msg)s" def __init__(self, msg): """Initialize.""" super().__init__() self.msg = msg class LockError(BzrFormatsError): """Base class for lock-related errors.""" _fmt = "Lock error: %(msg)s" internal_error = False class ObjectNotLocked(LockError): """Object is not locked.""" _fmt = "%(obj)r is not locked" def __init__(self, obj): """Initialize.""" super().__init__() self.obj = obj class ReadOnlyError(LockError): """A write attempt was made in a read-only transaction.""" _fmt = "A write attempt was made in a read only transaction on %(obj)s" def __init__(self, obj): """Initialize.""" super().__init__() self.obj = obj class ReadOnlyObjectDirtiedError(ReadOnlyError): """Cannot change object in a read-only transaction.""" _fmt = "Cannot change object %(obj)r in read only transaction" class OutSideTransaction(BzrFormatsError): """Operation attempted after the transaction finished.""" _fmt = ( "A transaction related operation was attempted after the transaction finished." ) class LockContention(LockError): """Could not acquire lock.""" _fmt = 'Could not acquire lock "%(lock)s": %(msg)s' def __init__(self, lock, msg=""): """Initialize.""" super().__init__() self.lock = lock self.msg = msg class LockNotHeld(LockError): """Lock is not held.""" _fmt = "Lock not held: %(lock)s" def __init__(self, lock): """Initialize.""" super().__init__() self.lock = lock class InvalidNormalization(PathError): """Path is not unicode normalized.""" _fmt = 'Path "%(path)s" is not unicode normalized' class AlreadyVersionedError(BzrFormatsError): """Path is already versioned.""" _fmt = "%(context_info)s%(path)s is already versioned." def __init__(self, path, context_info=None): """Initialize.""" super().__init__() self.path = path if context_info is None: self.context_info = "" else: self.context_info = context_info + ". " class NotVersionedError(BzrFormatsError): """Path is not versioned.""" _fmt = "%(context_info)s%(path)s is not versioned." def __init__(self, path, context_info=""): """Initialize.""" super().__init__() self.path = path if context_info: self.context_info = context_info + ". " else: self.context_info = "" class NoSuchRevision(InternalBzrFormatsError): """Branch has no such revision.""" _fmt = "%(branch)s has no revision %(revision)s" def __init__(self, branch, revision): """Initialize.""" super().__init__() self.branch = branch self.revision = revision bzrformats_3.4.0.orig/bzrformats/generate_ids.py0000644000000000000000000000165015162073400017052 0ustar00# Copyright (C) 2006, 2007, 2009, 2010, 2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Common code for generating file or revision ids.""" from ._bzr_rs import ( # noqa: F401 _next_id_suffix, gen_file_id, gen_revision_id, gen_root_id, ) bzrformats_3.4.0.orig/bzrformats/groupcompress.py0000644000000000000000000027473615162115103017347 0ustar00# Copyright (C) 2008-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Core compression logic for compressing streams of related files.""" import logging import time import zlib from . import osutils from ._bzr_rs import groupcompress as _groupcompress_rs from .btree_index import BTreeBuilder from .errors import ( BzrFormatsError, InvalidRevisionId, ObjectNotLocked, ReadOnlyError, RevisionNotPresent, ) from .lru_cache import LRUSizeCache from .osutils import sha_strings from .versionedfile import ( AbsentContentFactory, ChunkedContentFactory, ExistingContent, UnavailableRepresentation, VersionedFilesWithFallbacks, _KeyRefs, adapter_registry, ) evil_logger = logging.getLogger("bzrformats.evil") logger = logging.getLogger("bzrformats.groupcompress") _null_sha1 = _groupcompress_rs.NULL_SHA1 PythonGroupCompressor = _groupcompress_rs.TraditionalGroupCompressor rabin_hash = _groupcompress_rs.rabin_hash # Minimum number of uncompressed bytes to try fetch at once when retrieving # groupcompress blocks. BATCH_SIZE = 2**16 def as_tuples(obj): """Ensure that the object and any referenced objects are plain tuples. :param obj: a list, tuple or StaticTuple :return: a plain tuple instance, with all children also being tuples. """ result = [] for item in obj: if isinstance(item, (tuple, list)): item = as_tuples(item) result.append(item) return tuple(result) def sort_gc_optimal(parent_map): """Sort and group the keys in parent_map into groupcompress order. groupcompress is defined (currently) as reverse-topological order, grouped by the key prefix. :return: A sorted-list of keys """ import vcsgraph.tsort as tsort # groupcompress ordering is approximately reverse topological, # properly grouped by file-id. per_prefix_map = {} for key, value in parent_map.items(): prefix = b"" if isinstance(key, bytes) or len(key) == 1 else key[0] try: per_prefix_map[prefix][key] = value except KeyError: per_prefix_map[prefix] = {key: value} present_keys = [] for prefix in sorted(per_prefix_map): present_keys.extend(reversed(tsort.topo_sort(per_prefix_map[prefix]))) return present_keys class DecompressCorruption(BzrFormatsError): """Exception raised when repository file decompression fails.""" _fmt = "Corruption while decompressing repository file%(orig_error)s" def __init__(self, orig_error=None): """Initialize DecompressCorruption. Args: orig_error: The original error that caused the corruption. """ if orig_error is not None: self.orig_error = f", {orig_error}" else: self.orig_error = "" super().__init__() # The max zlib window size is 32kB, so if we set 'max_size' output of the # decompressor to the requested bytes + 32kB, then we should guarantee # num_bytes coming out. _ZLIB_DECOMP_WINDOW = 32 * 1024 class GroupCompressBlock: """An object which maintains the internal structure of the compressed data. This tracks the meta info (start of text, length, type, etc.) """ # Group Compress Block v1 Zlib GCB_HEADER = b"gcb1z\n" # Group Compress Block v1 Lzma GCB_LZ_HEADER = b"gcb1l\n" GCB_KNOWN_HEADERS = (GCB_HEADER, GCB_LZ_HEADER) def __init__(self): """Initialize a GroupCompressBlock.""" # map by key? or just order in file? self._compressor_name = None self._z_content_chunks = None self._z_content_decompressor = None self._z_content_length = None self._content_length = None self._content = None self._content_chunks = None def __len__(self): """Return the maximum number of bytes this block will reference.""" # This is the maximum number of bytes this object will reference if # everything is decompressed. However, if we decompress less than # everything... (this would cause some problems for LRUSizeCache) return self._content_length + self._z_content_length def _ensure_content(self, num_bytes=None): """Make sure that content has been expanded enough. :param num_bytes: Ensure that we have extracted at least num_bytes of content. If None, consume everything """ if self._content_length is None: raise AssertionError("self._content_length should never be None") if num_bytes is None: num_bytes = self._content_length elif self._content_length is not None and num_bytes > self._content_length: raise AssertionError( "requested num_bytes (%d) > content length (%d)" % (num_bytes, self._content_length) ) # Expand the content if required if self._content is None and self._content_chunks is not None: self._content = b"".join(self._content_chunks) self._content_chunks = None if self._content is None: # We join self._z_content_chunks here, because if we are # decompressing, then it is *very* likely that we have a single # chunk if self._z_content_chunks is None: raise AssertionError("No content to decompress") z_content = b"".join(self._z_content_chunks) if z_content == b"": self._content = b"" elif self._compressor_name == "lzma": # We don't do partial lzma decomp yet import pylzma self._content = pylzma.decompress(z_content) elif self._compressor_name == "zlib": # Start a zlib decompressor if num_bytes * 4 > self._content_length * 3: # If we are requesting more that 3/4ths of the content, # just extract the whole thing in a single pass num_bytes = self._content_length self._content = zlib.decompress(z_content) else: self._z_content_decompressor = zlib.decompressobj() # Seed the decompressor with the uncompressed bytes, so # that the rest of the code is simplified self._content = self._z_content_decompressor.decompress( z_content, num_bytes + _ZLIB_DECOMP_WINDOW ) if not self._z_content_decompressor.unconsumed_tail: self._z_content_decompressor = None else: raise AssertionError(f"Unknown compressor: {self._compressor_name!r}") # Any bytes remaining to be decompressed will be in the decompressors # 'unconsumed_tail' # Do we have enough bytes already? if len(self._content) >= num_bytes: return # If we got this far, and don't have a decompressor, something is wrong if self._z_content_decompressor is None: raise AssertionError("No decompressor to decompress %d bytes" % num_bytes) remaining_decomp = self._z_content_decompressor.unconsumed_tail if not remaining_decomp: raise AssertionError("Nothing left to decompress") needed_bytes = num_bytes - len(self._content) # We always set max_size to 32kB over the minimum needed, so that # zlib will give us as much as we really want. # TODO: If this isn't good enough, we could make a loop here, # that keeps expanding the request until we get enough self._content += self._z_content_decompressor.decompress( remaining_decomp, needed_bytes + _ZLIB_DECOMP_WINDOW ) if len(self._content) < num_bytes: raise AssertionError( "%d bytes wanted, only %d available" % (num_bytes, len(self._content)) ) if not self._z_content_decompressor.unconsumed_tail: # The stream is finished self._z_content_decompressor = None def _parse_bytes(self, data, pos): """Read the various lengths from the header. This also populates the various 'compressed' buffers. :return: The position in bytes just after the last newline """ # At present, we have 2 integers for the compressed and uncompressed # content. In base10 (ascii) 14 bytes can represent > 1TB, so to avoid # checking too far, cap the search to 14 bytes. pos2 = data.index(b"\n", pos, pos + 14) self._z_content_length = int(data[pos:pos2]) pos = pos2 + 1 pos2 = data.index(b"\n", pos, pos + 14) self._content_length = int(data[pos:pos2]) pos = pos2 + 1 if len(data) != (pos + self._z_content_length): # XXX: Define some GCCorrupt error ? raise AssertionError( "Invalid bytes: (%d) != %d + %d" % (len(data), pos, self._z_content_length) ) self._z_content_chunks = (data[pos:],) @property def _z_content(self): """Return z_content_chunks as a simple string. Meant only to be used by the test suite. """ if self._z_content_chunks is not None: return b"".join(self._z_content_chunks) return None @classmethod def from_bytes(cls, bytes): """Create a GroupCompressBlock from bytes. Args: bytes: The compressed block data. Returns: A new GroupCompressBlock instance. """ out = cls() header = bytes[:6] if header not in cls.GCB_KNOWN_HEADERS: raise ValueError( f"bytes did not start with any of {cls.GCB_KNOWN_HEADERS!r}" ) if header == cls.GCB_HEADER: out._compressor_name = "zlib" elif header == cls.GCB_LZ_HEADER: out._compressor_name = "lzma" else: raise ValueError(f"unknown compressor: {header!r}") out._parse_bytes(bytes, 6) return out def extract(self, key, start, end, sha1=None): """Extract the text for a specific key. :param key: The label used for this content :param sha1: TODO (should we validate only when sha1 is supplied?) :return: The bytes for the content """ if start == end == 0: return [] self._ensure_content(end) # The bytes are 'f' or 'd' for the type, then a variable-length # base128 integer for the content size, then the actual content # We know that the variable-length integer won't be longer than 5 # bytes (it takes 5 bytes to encode 2^32) c = self._content[start : start + 1] if c == b"f": pass else: if c != b"d": raise ValueError(f"Unknown content control code: {c}") content_len, len_len = decode_base128_int(self._content[start + 1 : start + 6]) content_start = start + 1 + len_len if end != content_start + content_len: raise ValueError( "end != len according to field header" f" {end} != {content_start + content_len}" ) if c == b"f": return [self._content[content_start:end]] # Must be type delta as checked above return [apply_delta_to_source(self._content, content_start, end)] def set_chunked_content(self, content_chunks, length): """Set the content of this block to the given chunks.""" # If we have lots of short lines, it is may be more efficient to join # the content ahead of time. If the content is <10MiB, we don't really # care about the extra memory consumption, so we can just pack it and # be done. However, timing showed 18s => 17.9s for repacking 1k revs of # mysql, which is below the noise margin self._content_length = length self._content_chunks = content_chunks self._content = None self._z_content_chunks = None def set_content(self, content): """Set the content of this block.""" self._content_length = len(content) self._content = content self._z_content_chunks = None def _create_z_content_from_chunks(self, chunks): compressor = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION) # Peak in this point is 1 fulltext, 1 compressed text, + zlib overhead # (measured peak is maybe 30MB over the above...) compressed_chunks = list(map(compressor.compress, chunks)) compressed_chunks.append(compressor.flush()) # Ignore empty chunks self._z_content_chunks = [c for c in compressed_chunks if c] self._z_content_length = sum(map(len, self._z_content_chunks)) def _create_z_content(self): if self._z_content_chunks is not None: return if self._content_chunks is not None: chunks = self._content_chunks else: chunks = (self._content,) self._create_z_content_from_chunks(chunks) def to_chunks(self): """Create the byte stream as a series of 'chunks'.""" self._create_z_content() header = self.GCB_HEADER chunks = [ b"%s%d\n%d\n" % (header, self._z_content_length, self._content_length), ] chunks.extend(self._z_content_chunks) total_len = sum(map(len, chunks)) return total_len, chunks def to_bytes(self): """Encode the information into a byte stream.""" _total_len, chunks = self.to_chunks() return b"".join(chunks) def _dump(self, include_text=False): """Take this block, and spit out a human-readable structure. :param include_text: Inserts also include text bits, chose whether you want this displayed in the dump or not. :return: A dump of the given block. The layout is something like: [('f', length), ('d', delta_length, text_length, [delta_info])] delta_info := [('i', num_bytes, text), ('c', offset, num_bytes), ...] """ self._ensure_content() result = [] pos = 0 while pos < self._content_length: kind = self._content[pos : pos + 1] pos += 1 if kind not in (b"f", b"d"): raise ValueError(f"invalid kind character: {kind!r}") content_len, len_len = decode_base128_int(self._content[pos : pos + 5]) pos += len_len if content_len + pos > self._content_length: raise ValueError( "invalid content_len %d for record @ pos %d" % (content_len, pos - len_len - 1) ) if kind == b"f": # Fulltext if include_text: text = self._content[pos : pos + content_len] result.append((b"f", content_len, text)) else: result.append((b"f", content_len)) elif kind == b"d": # Delta delta_content = self._content[pos : pos + content_len] delta_info = [] # The first entry in a delta is the decompressed length decomp_len, delta_pos = decode_base128_int(delta_content) result.append((b"d", content_len, decomp_len, delta_info)) measured_len = 0 while delta_pos < content_len: c = delta_content[delta_pos] delta_pos += 1 if c & 0x80: # Copy (offset, length, delta_pos) = decode_copy_instruction( delta_content, c, delta_pos ) if include_text: text = self._content[offset : offset + length] delta_info.append((b"c", offset, length, text)) else: delta_info.append((b"c", offset, length)) measured_len += length else: # Insert if include_text: txt = delta_content[delta_pos : delta_pos + c] else: txt = b"" delta_info.append((b"i", c, txt)) measured_len += c delta_pos += c if delta_pos != content_len: raise ValueError( "Delta consumed a bad number of bytes:" " %d != %d" % (delta_pos, content_len) ) if measured_len != decomp_len: raise ValueError( "Delta claimed fulltext was %d bytes, but" " extraction resulted in %d bytes" % (decomp_len, measured_len) ) pos += content_len return result class _LazyGroupCompressFactory: """Yield content from a GroupCompressBlock on demand.""" def __init__(self, key, parents, manager, start, end, first): """Create a _LazyGroupCompressFactory. :param key: The key of just this record :param parents: The parents of this key (possibly None) :param gc_block: A GroupCompressBlock object :param start: Offset of the first byte for this record in the uncompressd content :param end: Offset of the byte just after the end of this record (ie, bytes = content[start:end]) :param first: Is this the first Factory for the given block? """ self.key = key self.parents = parents self.sha1 = None self.size = None # Note: This attribute coupled with Manager._factories creates a # reference cycle. Perhaps we would rather use a weakref(), or # find an appropriate time to release the ref. After the first # get_bytes_as call? After Manager.get_record_stream() returns # the object? self._manager = manager self._chunks = None self.storage_kind = "groupcompress-block" if not first: self.storage_kind = "groupcompress-block-ref" self._first = first self._start = start self._end = end def __repr__(self): return f"{self.__class__.__name__}({self.key}, first={self._first})" def _extract_bytes(self): # Grab and cache the raw bytes for this entry # and break the ref-cycle with _manager since we don't need it # anymore try: self._manager._prepare_for_extract() except zlib.error as value: raise DecompressCorruption("zlib: " + str(value)) from value block = self._manager._block self._chunks = block.extract(self.key, self._start, self._end) # There are code paths that first extract as fulltext, and then # extract as storage_kind (smart fetch). So we don't break the # refcycle here, but instead in manager.get_record_stream() def get_bytes_as(self, storage_kind): if storage_kind == self.storage_kind: if self._first: # wire bytes, something... return self._manager._wire_bytes() else: return b"" if storage_kind in ("fulltext", "chunked", "lines"): if self._chunks is None: self._extract_bytes() if storage_kind == "fulltext": return b"".join(self._chunks) elif storage_kind == "chunked": return self._chunks else: return osutils.chunks_to_lines(self._chunks) raise UnavailableRepresentation(self.key, storage_kind, self.storage_kind) def iter_bytes_as(self, storage_kind): if self._chunks is None: self._extract_bytes() if storage_kind == "chunked": return iter(self._chunks) elif storage_kind == "lines": return osutils.chunks_to_lines_iter(iter(self._chunks)) raise UnavailableRepresentation(self.key, storage_kind, self.storage_kind) class _LazyGroupContentManager: """This manages a group of _LazyGroupCompressFactory objects.""" _max_cut_fraction = 0.75 # We allow a block to be trimmed to 75% of # current size, and still be considered # resuable _full_block_size = 4 * 1024 * 1024 _full_mixed_block_size = 2 * 1024 * 1024 _full_enough_block_size = 3 * 1024 * 1024 # size at which we won't repack _full_enough_mixed_block_size = 2 * 768 * 1024 # 1.5MB def __init__(self, block, get_compressor_settings=None): self._block = block # We need to preserve the ordering self._factories = [] self._last_byte = 0 self._get_settings = get_compressor_settings self._compressor_settings = None def _get_compressor_settings(self): if self._compressor_settings is not None: return self._compressor_settings settings = None if self._get_settings is not None: settings = self._get_settings() if settings is None: vf = GroupCompressVersionedFiles settings = vf._DEFAULT_COMPRESSOR_SETTINGS self._compressor_settings = settings return self._compressor_settings def add_factory(self, key, parents, start, end): first = bool(not self._factories) # Note that this creates a reference cycle.... factory = _LazyGroupCompressFactory(key, parents, self, start, end, first=first) # max() works here, but as a function call, doing a compare seems to be # significantly faster, timeit says 250ms for max() and 100ms for the # comparison if end > self._last_byte: self._last_byte = end self._factories.append(factory) def get_record_stream(self): """Get a record for all keys added so far.""" for factory in self._factories: yield factory # Break the ref-cycle factory._bytes = None factory._manager = None # TODO: Consider setting self._factories = None after the above loop, # as it will break the reference cycle def _trim_block(self, last_byte): """Create a new GroupCompressBlock, with just some of the content.""" # None of the factories need to be adjusted, because the content is # located in an identical place. Just that some of the unreferenced # trailing bytes are stripped logger.debug( "stripping trailing bytes from groupcompress block %d => %d", self._block._content_length, last_byte, ) new_block = GroupCompressBlock() self._block._ensure_content(last_byte) new_block.set_content(self._block._content[:last_byte]) self._block = new_block def _make_group_compressor(self): return GroupCompressor(self._get_compressor_settings()) def _rebuild_block(self): """Create a new GroupCompressBlock with only the referenced texts.""" compressor = self._make_group_compressor() tstart = time.time() old_length = self._block._content_length end_point = 0 for factory in self._factories: chunks = factory.get_bytes_as("chunked") chunks_len = factory.size if chunks_len is None: chunks_len = sum(map(len, chunks)) (found_sha1, start_point, end_point, _type) = compressor.compress( factory.key, chunks, chunks_len, factory.sha1 ) # Now update this factory with the new offsets, etc factory.sha1 = found_sha1 factory._start = start_point factory._end = end_point self._last_byte = end_point new_block = compressor.flush() # TODO: Should we check that new_block really *is* smaller than the old # block? It seems hard to come up with a method that it would # expand, since we do full compression again. Perhaps based on a # request that ends up poorly ordered? # TODO: If the content would have expanded, then we would want to # handle a case where we need to split the block. # Now that we have a user-tweakable option # (max_bytes_to_index), it is possible that one person set it # to a very low value, causing poor compression. delta = time.time() - tstart self._block = new_block logger.debug( "creating new compressed block on-the-fly in %.3fs %d bytes => %d bytes", delta, old_length, self._block._content_length, ) def _prepare_for_extract(self): """A _LazyGroupCompressFactory is about to extract to fulltext.""" # We expect that if one child is going to fulltext, all will be. This # helps prevent all of them from extracting a small amount at a time. # Which in itself isn't terribly expensive, but resizing 2MB 32kB at a # time (self._block._content) is a little expensive. self._block._ensure_content(self._last_byte) def _check_rebuild_action(self): """Check to see if our block should be repacked.""" total_bytes_used = 0 last_byte_used = 0 for factory in self._factories: total_bytes_used += factory._end - factory._start if last_byte_used < factory._end: last_byte_used = factory._end # If we are using more than half of the bytes from the block, we have # nothing else to check if total_bytes_used * 2 >= self._block._content_length: return None, last_byte_used, total_bytes_used # We are using less than 50% of the content. Is the content we are # using at the beginning of the block? If so, we can just trim the # tail, rather than rebuilding from scratch. if total_bytes_used * 2 > last_byte_used: return "trim", last_byte_used, total_bytes_used # We are using a small amount of the data, and it isn't just packed # nicely at the front, so rebuild the content. # Note: This would be *nicer* as a strip-data-from-group, rather than # building it up again from scratch # It might be reasonable to consider the fulltext sizes for # different bits when deciding this, too. As you may have a small # fulltext, and a trivial delta, and you are just trading around # for another fulltext. If we do a simple 'prune' you may end up # expanding many deltas into fulltexts, as well. # If we build a cheap enough 'strip', then we could try a strip, # if that expands the content, we then rebuild. return "rebuild", last_byte_used, total_bytes_used def check_is_well_utilized(self): """Is the current block considered 'well utilized'? This heuristic asks if the current block considers itself to be a fully developed group, rather than just a loose collection of data. """ if len(self._factories) == 1: # A block of length 1 could be improved by combining with other # groups - don't look deeper. Even larger than max size groups # could compress well with adjacent versions of the same thing. return False _action, _last_byte_used, total_bytes_used = self._check_rebuild_action() block_size = self._block._content_length if total_bytes_used < block_size * self._max_cut_fraction: # This block wants to trim itself small enough that we want to # consider it under-utilized. return False # TODO: This code is meant to be the twin of _insert_record_stream's # 'start_new_block' logic. It would probably be better to factor # out that logic into a shared location, so that it stays # together better # We currently assume a block is properly utilized whenever it is >75% # of the size of a 'full' block. In normal operation, a block is # considered full when it hits 4MB of same-file content. So any block # >3MB is 'full enough'. # The only time this isn't true is when a given block has large-object # content. (a single file >4MB, etc.) # Under these circumstances, we allow a block to grow to # 2 x largest_content. Which means that if a given block had a large # object, it may actually be under-utilized. However, given that this # is 'pack-on-the-fly' it is probably reasonable to not repack large # content blobs on-the-fly. Note that because we return False for all # 1-item blobs, we will repack them; we may wish to reevaluate our # treatment of large object blobs in the future. if block_size >= self._full_enough_block_size: return True # If a block is <3MB, it still may be considered 'full' if it contains # mixed content. The current rule is 2MB of mixed content is considered # full. So check to see if this block contains mixed content, and # set the threshold appropriately. common_prefix = None for factory in self._factories: prefix = factory.key[:-1] if common_prefix is None: common_prefix = prefix elif prefix != common_prefix: # Mixed content, check the size appropriately if block_size >= self._full_enough_mixed_block_size: return True break # The content failed both the mixed check and the single-content check # so obviously it is not fully utilized # TODO: there is one other constraint that isn't being checked # namely, that the entries in the block are in the appropriate # order. For example, you could insert the entries in exactly # reverse groupcompress order, and we would think that is ok. # (all the right objects are in one group, and it is fully # utilized, etc.) For now, we assume that case is rare, # especially since we should always fetch in 'groupcompress' # order. return False def _check_rebuild_block(self): action, last_byte_used, _total_bytes_used = self._check_rebuild_action() if action is None: return if action == "trim": self._trim_block(last_byte_used) elif action == "rebuild": self._rebuild_block() else: raise ValueError(f"unknown rebuild action: {action!r}") def _wire_bytes(self): """Return a byte stream suitable for transmitting over the wire.""" self._check_rebuild_block() # The outer block starts with: # 'groupcompress-block\n' # \n # \n # \n #
# lines = [b"groupcompress-block\n"] # The minimal info we need is the key, the start offset, and the # parents. The length and type are encoded in the record itself. # However, passing in the other bits makes it easier. The list of # keys, and the start offset, the length # 1 line key # 1 line with parents, '' for () # 1 line for start offset # 1 line for end byte header_lines = [] for factory in self._factories: key_bytes = b"\x00".join(factory.key) parents = factory.parents if parents is None: parent_bytes = b"None:" else: parent_bytes = b"\t".join(b"\x00".join(key) for key in parents) record_header = b"%s\n%s\n%d\n%d\n" % ( key_bytes, parent_bytes, factory._start, factory._end, ) header_lines.append(record_header) # TODO: Can we break the refcycle at this point and set # factory._manager = None? header_bytes = b"".join(header_lines) del header_lines header_bytes_len = len(header_bytes) z_header_bytes = zlib.compress(header_bytes) del header_bytes z_header_bytes_len = len(z_header_bytes) block_bytes_len, block_chunks = self._block.to_chunks() lines.append( b"%d\n%d\n%d\n" % (z_header_bytes_len, header_bytes_len, block_bytes_len) ) lines.append(z_header_bytes) lines.extend(block_chunks) del z_header_bytes, block_chunks # TODO: This is a point where we will double the memory consumption. To # avoid this, we probably have to switch to a 'chunked' api return b"".join(lines) @classmethod def from_bytes(cls, bytes): # TODO: This does extra string copying, probably better to do it a # different way. At a minimum this creates 2 copies of the # compressed content (storage_kind, z_header_len, header_len, block_len, rest) = bytes.split( b"\n", 4 ) del bytes if storage_kind != b"groupcompress-block": raise ValueError(f"Unknown storage kind: {storage_kind}") z_header_len = int(z_header_len) if len(rest) < z_header_len: raise ValueError("Compressed header len shorter than all bytes") z_header = rest[:z_header_len] header_len = int(header_len) header = zlib.decompress(z_header) if len(header) != header_len: raise ValueError("invalid length for decompressed bytes") del z_header block_len = int(block_len) if len(rest) != z_header_len + block_len: raise ValueError("Invalid length for block") block_bytes = rest[z_header_len:] del rest # So now we have a valid GCB, we just need to parse the factories that # were sent to us header_lines = header.split(b"\n") del header last = header_lines.pop() if last != b"": raise ValueError("header lines did not end with a trailing newline") if len(header_lines) % 4 != 0: raise ValueError("The header was not an even multiple of 4 lines") block = GroupCompressBlock.from_bytes(block_bytes) del block_bytes result = cls(block) for start in range(0, len(header_lines), 4): # intern()? key = tuple(header_lines[start].split(b"\x00")) parents_line = header_lines[start + 1] if parents_line == b"None:": parents = None else: parents = tuple( [ tuple(segment.split(b"\x00")) for segment in parents_line.split(b"\t") if segment ] ) start_offset = int(header_lines[start + 2]) end_offset = int(header_lines[start + 3]) result.add_factory(key, parents, start_offset, end_offset) return result def network_block_to_records(storage_kind, bytes, line_end): """Convert a network block to records. Args: storage_kind: The type of storage (must be 'groupcompress-block'). bytes: The block data bytes. line_end: Line ending marker. Returns: Generator yielding (key, data) tuples. """ if storage_kind != "groupcompress-block": raise ValueError(f"Unknown storage kind: {storage_kind}") manager = _LazyGroupContentManager.from_bytes(bytes) return manager.get_record_stream() class PyrexGroupCompressor: """Produce a serialised group of compressed texts. It contains code very similar to SequenceMatcher because of having a similar task. However some key differences apply: * there is no junk, we want a minimal edit not a human readable diff. * we don't filter very common lines (because we don't know where a good range will start, and after the first text we want to be emitting minmal edits only. * we chain the left side, not the right side * we incrementally update the adjacency matrix as new lines are provided. * we look for matches in all of the left side, so the routine which does the analagous task of find_longest_match does not need to filter on the left side. """ chunks: list[bytes] def __init__(self, settings=None): """Create a GroupCompressor.""" self._last = None self.endpoint = 0 self.input_bytes = 0 self.labels_deltas = {} self._delta_index = None # Set by the children self._block = GroupCompressBlock() if settings is None: self._settings = {} else: self._settings = settings self.chunks = [] max_bytes_to_index = self._settings.get("max_bytes_to_index", 0) self._delta_index = DeltaIndex(max_bytes_to_index=max_bytes_to_index) def compress(self, key, chunks, length, expected_sha, nostore_sha=None, soft=False): """Compress lines with label key. :param key: A key tuple. It is stored in the output for identification of the text during decompression. If the last element is b'None' it is replaced with the sha1 of the text - e.g. sha1:xxxxxxx. :param chunks: Chunks of bytes to be compressed :param length: Length of chunks :param expected_sha: If non-None, the sha the lines are believed to have. During compression the sha is calculated; a mismatch will cause an error. :param nostore_sha: If the computed sha1 sum matches, we will raise ExistingContent rather than adding the text. :param soft: Do a 'soft' compression. This means that we require larger ranges to match to be considered for a copy command. :return: The sha1 of lines, the start and end offsets in the delta, and the type ('fulltext' or 'delta'). :seealso VersionedFiles.add_lines: """ if length == 0: # empty, like a dir entry, etc if nostore_sha == _null_sha1: raise ExistingContent() return _null_sha1, 0, 0, "fulltext" # we assume someone knew what they were doing when they passed it in sha1 = expected_sha if expected_sha is not None else sha_strings(chunks) if nostore_sha is not None and sha1 == nostore_sha: raise ExistingContent() if key[-1] is None: key = key[:-1] + (b"sha1:" + sha1,) start, end, type = self._compress(key, chunks, length, length / 2, soft) return sha1, start, end, type def _compress(self, key, chunks, input_len, max_delta_size, soft=False): """See _CommonGroupCompressor._compress.""" # By having action/label/sha1/len, we can parse the group if the index # was ever destroyed, we have the key in 'label', we know the final # bytes are valid from sha1, and we know where to find the end of this # record because of 'len'. (the delta record itself will store the # total length for the expanded record) # 'len: %d\n' costs approximately 1% increase in total data # Having the labels at all costs us 9-10% increase, 38% increase for # inventory pages, and 5.8% increase for text pages # new_chunks = ['label:%s\nsha1:%s\n' % (label, sha1)] if self._delta_index._source_offset != self.endpoint: raise AssertionError( "_source_offset != endpoint" " somehow the DeltaIndex got out of sync with" " the output lines" ) bytes = b"".join(chunks) delta = self._delta_index.make_delta(bytes, max_delta_size) if delta is None: type = "fulltext" enc_length = encode_base128_int(input_len) len_mini_header = 1 + len(enc_length) self._delta_index.add_source(bytes, len_mini_header) new_chunks = [b"f", enc_length] + chunks else: type = "delta" enc_length = encode_base128_int(len(delta)) len_mini_header = 1 + len(enc_length) new_chunks = [b"d", enc_length, delta] self._delta_index.add_delta_source(delta, len_mini_header) # Before insertion start = self.endpoint chunk_start = len(self.chunks) # Now output these bytes self._output_chunks(new_chunks) self.input_bytes += input_len chunk_end = len(self.chunks) self.labels_deltas[key] = (start, chunk_start, self.endpoint, chunk_end) if not self._delta_index._source_offset == self.endpoint: raise AssertionError( "the delta index is out of sync" f"with the output lines {self._delta_index._source_offset} != {self.endpoint}" ) return start, self.endpoint, type def _output_chunks(self, new_chunks): """Output some chunks. :param new_chunks: The chunks to output. """ self._last = (len(self.chunks), self.endpoint) endpoint = self.endpoint self.chunks.extend(new_chunks) endpoint += sum(map(len, new_chunks)) self.endpoint = endpoint def extract(self, key): """Extract a key previously added to the compressor. :param key: The key to extract. :return: An iterable over chunks and the sha1. """ (_start_byte, start_chunk, _end_byte, end_chunk) = self.labels_deltas[key] delta_chunks = self.chunks[start_chunk:end_chunk] stored_bytes = b"".join(delta_chunks) kind = stored_bytes[:1] if kind == b"f": fulltext_len, offset = decode_base128_int(stored_bytes[1:10]) data_len = fulltext_len + 1 + offset if data_len != len(stored_bytes): raise ValueError( "Index claimed fulltext len, but stored bytes" f" claim {len(stored_bytes)} != {data_len}" ) data = [stored_bytes[offset + 1 :]] else: if kind != b"d": raise ValueError(f"Unknown content kind, bytes claim {kind}") # XXX: This is inefficient at best source = b"".join(self.chunks[:start_chunk]) delta_len, offset = decode_base128_int(stored_bytes[1:10]) data_len = delta_len + 1 + offset if data_len != len(stored_bytes): raise ValueError( "Index claimed delta len, but stored bytes" f" claim {len(stored_bytes)} != {data_len}" ) data = [apply_delta(source, stored_bytes[offset + 1 :])] data_sha1 = sha_strings(data) return data, data_sha1 def flush(self): """Finish this group, creating a formatted stream. After calling this, the compressor should no longer be used """ self._block.set_chunked_content(self.chunks, self.endpoint) self._delta_index = None self.chunks = None return self._block def flush_without_last(self): """Flush the buffer after removing the last item. Returns: The flushed group compress block. """ self._pop_last() return self.flush() def _pop_last(self): """Call this if you want to 'revoke' the last compression. After this, the data structures will be rolled back, but you cannot do more compression. """ self._delta_index = None del self.chunks[self._last[0] :] self.endpoint = self._last[1] self._last = None def ratio(self): """Return the overall compression ratio.""" return float(self.input_bytes) / float(self.endpoint) def make_pack_factory(graph, delta, keylength, inconsistency_fatal=True): """Create a factory for creating a pack based groupcompress. This is only functional enough to run interface tests, it doesn't try to provide a full pack environment. :param graph: Store a graph. :param delta: Delta compress contents. :param keylength: How long should keys be. """ from .pack import ContainerWriter from .pack_repo import _DirectPackAccess def factory(transport): parents = graph ref_length = 0 if graph: ref_length = 1 graph_index = BTreeBuilder(reference_lists=ref_length, key_elements=keylength) stream = transport.open_write_stream("newpack") writer = ContainerWriter(stream.write) writer.begin() index = _GCGraphIndex( graph_index, lambda: True, parents=parents, add_callback=graph_index.add_nodes, inconsistency_fatal=inconsistency_fatal, ) access = _DirectPackAccess({}) access.set_writer(writer, graph_index, (transport, "newpack")) result = GroupCompressVersionedFiles(index, access, delta) result.stream = stream result.writer = writer return result return factory def cleanup_pack_group(versioned_files): """Clean up after packing a group of versioned files. Args: versioned_files: The versioned files to clean up. """ versioned_files.writer.end() versioned_files.stream.close() class _BatchingBlockFetcher: """Fetch group compress blocks in batches. :ivar total_bytes: int of expected number of bytes needed to fetch the currently pending batch. """ def __init__(self, gcvf, locations, get_compressor_settings=None): self.gcvf = gcvf self.locations = locations self.keys = [] self.batch_memos = {} self.memos_to_get = [] self.total_bytes = 0 self.last_read_memo = None self.manager = None self._get_compressor_settings = get_compressor_settings def add_key(self, key): """Add another to key to fetch. :return: The estimated number of bytes needed to fetch the batch so far. """ self.keys.append(key) index_memo, _, _, _ = self.locations[key] read_memo = index_memo[0:3] # Three possibilities for this read_memo: # - it's already part of this batch; or # - it's not yet part of this batch, but is already cached; or # - it's not yet part of this batch and will need to be fetched. if read_memo in self.batch_memos: # This read memo is already in this batch. return self.total_bytes try: cached_block = self.gcvf._group_cache[read_memo] except KeyError: # This read memo is new to this batch, and the data isn't cached # either. self.batch_memos[read_memo] = None self.memos_to_get.append(read_memo) byte_length = read_memo[2] self.total_bytes += byte_length else: # This read memo is new to this batch, but cached. # Keep a reference to the cached block in batch_memos because it's # certain that we'll use it when this batch is processed, but # there's a risk that it would fall out of _group_cache between now # and then. self.batch_memos[read_memo] = cached_block return self.total_bytes def _flush_manager(self): if self.manager is not None: yield from self.manager.get_record_stream() self.manager = None self.last_read_memo = None def yield_factories(self, full_flush=False): """Yield factories for keys added since the last yield. They will be returned in the order they were added via add_key. :param full_flush: by default, some results may not be returned in case they can be part of the next batch. If full_flush is True, then all results are returned. """ if self.manager is None and not self.keys: return # Fetch all memos in this batch. blocks = self.gcvf._get_blocks(self.memos_to_get) # Turn blocks into factories and yield them. memos_to_get_stack = list(self.memos_to_get) memos_to_get_stack.reverse() for key in self.keys: index_memo, _, parents, _ = self.locations[key] read_memo = index_memo[:3] if self.last_read_memo != read_memo: # We are starting a new block. If we have a # manager, we have found everything that fits for # now, so yield records yield from self._flush_manager() # Now start a new manager. if memos_to_get_stack and memos_to_get_stack[-1] == read_memo: # The next block from _get_blocks will be the block we # need. block_read_memo, block = next(blocks) if block_read_memo != read_memo: raise AssertionError( "block_read_memo out of sync with read_memo" f"({block_read_memo!r} != {read_memo!r})" ) self.batch_memos[read_memo] = block memos_to_get_stack.pop() else: block = self.batch_memos[read_memo] self.manager = _LazyGroupContentManager( block, get_compressor_settings=self._get_compressor_settings ) self.last_read_memo = read_memo start, end = index_memo[3:5] self.manager.add_factory(key, parents, start, end) if full_flush: yield from self._flush_manager() del self.keys[:] self.batch_memos.clear() del self.memos_to_get[:] self.total_bytes = 0 class GroupCompressVersionedFiles(VersionedFilesWithFallbacks): """A group-compress based VersionedFiles implementation.""" # This controls how the GroupCompress DeltaIndex works. Basically, we # compute hash pointers into the source blocks (so hash(text) => text). # However each of these references costs some memory in trade against a # more accurate match result. For very large files, they either are # pre-compressed and change in bulk whenever they change, or change in just # local blocks. Either way, 'improved resolution' is not very helpful, # versus running out of memory trying to track everything. The default max # gives 100% sampling of a 1MB file. _DEFAULT_MAX_BYTES_TO_INDEX = 1024 * 1024 _DEFAULT_COMPRESSOR_SETTINGS = {"max_bytes_to_index": _DEFAULT_MAX_BYTES_TO_INDEX} def __init__( self, index, access, delta=True, _unadded_refs=None, _group_cache=None ): """Create a GroupCompressVersionedFiles object. :param index: The index object storing access and graph data. :param access: The access object storing raw data. :param delta: Whether to delta compress or just entropy compress. :param _unadded_refs: private parameter, don't use. :param _group_cache: private parameter, don't use. """ self._index = index self._access = access self._delta = delta if _unadded_refs is None: _unadded_refs = {} self._unadded_refs = _unadded_refs if _group_cache is None: _group_cache = LRUSizeCache(max_size=50 * 1024 * 1024) self._group_cache = _group_cache self._immediate_fallback_vfs = [] self._max_bytes_to_index = None def without_fallbacks(self): """Return a clone of this object without any fallbacks configured.""" return GroupCompressVersionedFiles( self._index, self._access, self._delta, _unadded_refs=dict(self._unadded_refs), _group_cache=self._group_cache, ) def add_lines( self, key, parents, lines, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): r"""Add a text to the store. :param key: The key tuple of the text to add. :param parents: The parents key tuples of the text to add. :param lines: A list of lines. Each line must be a bytestring. And all of them except the last must be terminated with \n and contain no other \n's. The last line may either contain no \n's or a single terminating \n. If the lines list does meet this constraint the add routine may error or may succeed - but you will be unable to read the data back accurately. (Checking the lines have been split correctly is expensive and extremely unlikely to catch bugs so it is not done at runtime unless check_content is True.) :param parent_texts: An optional dictionary containing the opaque representations of some or all of the parents of version_id to allow delta optimisations. VERY IMPORTANT: the texts must be those returned by add_lines or data corruption can be caused. :param left_matching_blocks: a hint about which areas are common between the text and its left-hand-parent. The format is the SequenceMatcher.get_matching_blocks format. :param nostore_sha: Raise ExistingContent and do not add the lines to the versioned file if the digest of the lines matches this. :param random_id: If True a random id has been selected rather than an id determined by some deterministic process such as a converter from a foreign VCS. When True the backend may choose not to check for uniqueness of the resulting key within the versioned file, so this should only be done when the result is expected to be unique anyway. :param check_content: If True, the lines supplied are verified to be bytestrings that are correctly formed lines. :return: The text sha1, the number of bytes in the text, and an opaque representation of the inserted version which can be provided back to future add_lines calls in the parent_texts dictionary. """ self._index._check_write_ok() if check_content: self._check_lines_not_unicode(lines) self._check_lines_are_lines(lines) return self.add_content( ChunkedContentFactory(key, parents, sha_strings(lines), lines), parent_texts, left_matching_blocks, nostore_sha, random_id, ) def add_content( self, factory, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, ): """Add a text to the store. :param factory: A ContentFactory that can be used to retrieve the key, parents and contents. :param parent_texts: An optional dictionary containing the opaque representations of some or all of the parents of version_id to allow delta optimisations. VERY IMPORTANT: the texts must be those returned by add_lines or data corruption can be caused. :param left_matching_blocks: a hint about which areas are common between the text and its left-hand-parent. The format is the SequenceMatcher.get_matching_blocks format. :param nostore_sha: Raise ExistingContent and do not add the lines to the versioned file if the digest of the lines matches this. :param random_id: If True a random id has been selected rather than an id determined by some deterministic process such as a converter from a foreign VCS. When True the backend may choose not to check for uniqueness of the resulting key within the versioned file, so this should only be done when the result is expected to be unique anyway. :return: The text sha1, the number of bytes in the text, and an opaque representation of the inserted version which can be provided back to future add_lines calls in the parent_texts dictionary. """ self._index._check_write_ok() parents = factory.parents self._check_add(factory.key, random_id) if parents is None: # The caller might pass None if there is no graph data, but kndx # indexes can't directly store that, so we give them # an empty tuple instead. parents = () # double handling for now. Make it work until then. sha1, length = list( self._insert_record_stream( [factory], random_id=random_id, nostore_sha=nostore_sha ) )[0] return sha1, length, None def add_fallback_versioned_files(self, a_versioned_files): """Add a source of texts for texts not present in this knit. :param a_versioned_files: A VersionedFiles object. """ self._immediate_fallback_vfs.append(a_versioned_files) def annotate(self, key): """See VersionedFiles.annotate.""" ann = self.get_annotator() return ann.annotate_flat(key) def get_annotator(self): """Get an annotator for this versioned file. Returns: A VersionedFileAnnotator instance. """ from .annotate import VersionedFileAnnotator return VersionedFileAnnotator(self) def check(self, progress_bar=None, keys=None): """See VersionedFiles.check().""" if keys is None: keys = self.keys() for record in self.get_record_stream(keys, "unordered", True): for _chunk in record.iter_bytes_as("chunked"): pass else: return self.get_record_stream(keys, "unordered", True) def clear_cache(self): """See VersionedFiles.clear_cache().""" self._group_cache.clear() self._index._graph_index.clear_cache() self._index._int_cache.clear() def _check_add(self, key, random_id): """Check that version_id and lines are safe to add.""" version_id = key[-1] if version_id is not None and osutils.contains_whitespace(version_id): raise InvalidRevisionId(version_id, self) self.check_not_reserved_id(version_id) # TODO: If random_id==False and the key is already present, we should # probably check that the existing content is identical to what is # being inserted, and otherwise raise an exception. This would make # the bundle code simpler. def get_parent_map(self, keys): """Get a map of the graph parents of keys. :param keys: The keys to look up parents for. :return: A mapping from keys to parents. Absent keys are absent from the mapping. """ return self._get_parent_map_with_sources(keys)[0] def _get_parent_map_with_sources(self, keys): """Get a map of the parents of keys. :param keys: The keys to look up parents for. :return: A tuple. The first element is a mapping from keys to parents. Absent keys are absent from the mapping. The second element is a list with the locations each key was found in. The first element is the in-this-knit parents, the second the first fallback source, and so on. """ result = {} sources = [self._index] + self._immediate_fallback_vfs source_results = [] missing = set(keys) for source in sources: if not missing: break new_result = source.get_parent_map(missing) source_results.append(new_result) result.update(new_result) missing.difference_update(set(new_result)) return result, source_results def _get_blocks(self, read_memos): """Get GroupCompressBlocks for the given read_memos. :returns: a series of (read_memo, block) pairs, in the order they were originally passed. """ cached = {} for read_memo in read_memos: try: block = self._group_cache[read_memo] except KeyError: pass else: cached[read_memo] = block not_cached = [] not_cached_seen = set() for read_memo in read_memos: if read_memo in cached: # Don't fetch what we already have continue if read_memo in not_cached_seen: # Don't try to fetch the same data twice continue not_cached.append(read_memo) not_cached_seen.add(read_memo) raw_records = self._access.get_raw_records(not_cached) for read_memo in read_memos: try: yield read_memo, cached[read_memo] except KeyError: # Read the block, and cache it. zdata = next(raw_records) block = GroupCompressBlock.from_bytes(zdata) self._group_cache[read_memo] = block cached[read_memo] = block yield read_memo, block def get_missing_compression_parent_keys(self): """Return the keys of missing compression parents. Missing compression parents occur when a record stream was missing basis texts, or a index was scanned that had missing basis texts. """ # GroupCompress cannot currently reference texts that are not in the # group, so this is valid for now return frozenset() def get_record_stream(self, keys, ordering, include_delta_closure): """Get a stream of records for keys. :param keys: The keys to include. :param ordering: Either 'unordered' or 'topological'. A topologically sorted stream has compression parents strictly before their children. :param include_delta_closure: If True then the closure across any compression parents will be included (in the opaque data). :return: An iterator of ContentFactory objects, each of which is only valid until the iterator is advanced. """ from .pack_repo import RetryWithNewPacks # keys might be a generator orig_keys = list(keys) keys = set(keys) if not keys: return if not self._index.has_graph and ordering in ("topological", "groupcompress"): # Cannot topological order when no graph has been stored. # but we allow 'as-requested' or 'unordered' ordering = "unordered" remaining_keys = keys while True: try: keys = set(remaining_keys) for content_factory in self._get_remaining_record_stream( keys, orig_keys, ordering, include_delta_closure ): remaining_keys.discard(content_factory.key) yield content_factory return except RetryWithNewPacks as e: self._access.reload_or_raise(e) def _find_from_fallback(self, missing): """Find whatever keys you can from the fallbacks. :param missing: A set of missing keys. This set will be mutated as keys are found from a fallback_vfs :return: (parent_map, key_to_source_map, source_results) parent_map the overall key => parent_keys key_to_source_map a dict from {key: source} source_results a list of (source: keys) """ parent_map = {} key_to_source_map = {} source_results = [] for source in self._immediate_fallback_vfs: if not missing: break source_parents = source.get_parent_map(missing) parent_map.update(source_parents) source_parents = list(source_parents) source_results.append((source, source_parents)) key_to_source_map.update((key, source) for key in source_parents) missing.difference_update(source_parents) return parent_map, key_to_source_map, source_results def _get_ordered_source_keys(self, ordering, parent_map, key_to_source_map): """Get the (source, [keys]) list. The returned objects should be in the order defined by 'ordering', which can weave between different sources. :param ordering: Must be one of 'topological' or 'groupcompress' :return: List of [(source, [keys])] tuples, such that all keys are in the defined order, regardless of source. """ import vcsgraph.tsort as tsort if ordering == "topological": present_keys = tsort.topo_sort(parent_map) else: # ordering == 'groupcompress' # XXX: This only optimizes for the target ordering. We may need # to balance that with the time it takes to extract # ordering, by somehow grouping based on # locations[key][0:3] present_keys = sort_gc_optimal(parent_map) # Now group by source: source_keys = [] current_source = None for key in present_keys: source = key_to_source_map.get(key, self) if source is not current_source: source_keys.append((source, [])) current_source = source source_keys[-1][1].append(key) return source_keys def _get_as_requested_source_keys( self, orig_keys, locations, unadded_keys, key_to_source_map ): source_keys = [] current_source = None for key in orig_keys: if key in locations or key in unadded_keys: source = self elif key in key_to_source_map: source = key_to_source_map[key] else: # absent continue if source is not current_source: source_keys.append((source, [])) current_source = source source_keys[-1][1].append(key) return source_keys def _get_io_ordered_source_keys(self, locations, unadded_keys, source_result): def get_group(key): # This is the group the bytes are stored in, followed by the # location in the group return locations[key][0] # We don't have an ordering for keys in the in-memory object, but # lets process the in-memory ones first. present_keys = list(unadded_keys) present_keys.extend(sorted(locations, key=get_group)) # Now grab all of the ones from other sources source_keys = [(self, present_keys)] source_keys.extend(source_result) return source_keys def _get_remaining_record_stream( self, keys, orig_keys, ordering, include_delta_closure ): """Get a stream of records for keys. :param keys: The keys to include. :param ordering: one of 'unordered', 'topological', 'groupcompress' or 'as-requested' :param include_delta_closure: If True then the closure across any compression parents will be included (in the opaque data). :return: An iterator of ContentFactory objects, each of which is only valid until the iterator is advanced. """ # Cheap: iterate locations = self._index.get_build_details(keys) unadded_keys = set(self._unadded_refs).intersection(keys) missing = keys.difference(locations) missing.difference_update(unadded_keys) ( fallback_parent_map, key_to_source_map, source_result, ) = self._find_from_fallback(missing) if ordering in ("topological", "groupcompress"): # would be better to not globally sort initially but instead # start with one key, recurse to its oldest parent, then grab # everything in the same group, etc. parent_map = {key: details[2] for key, details in locations.items()} for key in unadded_keys: parent_map[key] = self._unadded_refs[key] parent_map.update(fallback_parent_map) source_keys = self._get_ordered_source_keys( ordering, parent_map, key_to_source_map ) elif ordering == "as-requested": source_keys = self._get_as_requested_source_keys( orig_keys, locations, unadded_keys, key_to_source_map ) else: # We want to yield the keys in a semi-optimal (read-wise) ordering. # Otherwise we thrash the _group_cache and destroy performance source_keys = self._get_io_ordered_source_keys( locations, unadded_keys, source_result ) for key in missing: yield AbsentContentFactory(key) # Batch up as many keys as we can until either: # - we encounter an unadded ref, or # - we run out of keys, or # - the total bytes to retrieve for this batch > BATCH_SIZE batcher = _BatchingBlockFetcher( self, locations, get_compressor_settings=self._get_compressor_settings ) for source, keys in source_keys: if source is self: for key in keys: if key in self._unadded_refs: # Flush batch, then yield unadded ref from # self._compressor. yield from batcher.yield_factories(full_flush=True) chunks, sha1 = self._compressor.extract(key) parents = self._unadded_refs[key] yield ChunkedContentFactory(key, parents, sha1, chunks) continue if batcher.add_key(key) > BATCH_SIZE: # Ok, this batch is big enough. Yield some results. yield from batcher.yield_factories() else: yield from batcher.yield_factories(full_flush=True) yield from source.get_record_stream( keys, ordering, include_delta_closure ) yield from batcher.yield_factories(full_flush=True) def get_sha1s(self, keys): """See VersionedFiles.get_sha1s().""" result = {} for record in self.get_record_stream(keys, "unordered", True): if record.sha1 is not None: result[record.key] = record.sha1 else: if record.storage_kind != "absent": result[record.key] = sha_strings(record.iter_bytes_as("chunked")) return result def insert_record_stream(self, stream): """Insert a record stream into this container. :param stream: A stream of records to insert. :return: None :seealso VersionedFiles.get_record_stream: """ # XXX: Setting random_id=True makes # test_insert_record_stream_existing_keys fail for groupcompress and # groupcompress-nograph, this needs to be revisited while addressing # 'bzr branch' performance issues. for _, _ in self._insert_record_stream(stream, random_id=False): pass def _get_compressor_settings(self): if self._max_bytes_to_index is None: self._max_bytes_to_index = self._DEFAULT_MAX_BYTES_TO_INDEX return {"max_bytes_to_index": self._max_bytes_to_index} def _make_group_compressor(self): return GroupCompressor(self._get_compressor_settings()) def _insert_record_stream( self, stream, random_id=False, nostore_sha=None, reuse_blocks=True ): """Internal core to insert a record stream into this container. This helper function has a different interface than insert_record_stream to allow add_lines to be minimal, but still return the needed data. :param stream: A stream of records to insert. :param nostore_sha: If the sha1 of a given text matches nostore_sha, raise ExistingContent, rather than committing the new text. :param reuse_blocks: If the source is streaming from groupcompress-blocks, just insert the blocks as-is, rather than expanding the texts and inserting again. :return: An iterator over (sha1, length) of the inserted records. :seealso insert_record_stream: :seealso add_lines: """ adapters = {} def get_adapter(adapter_key): try: return adapters[adapter_key] except KeyError: adapter_factory = adapter_registry.get(adapter_key) adapter = adapter_factory(self) adapters[adapter_key] = adapter return adapter # This will go up to fulltexts for gc to gc fetching, which isn't # ideal. self._compressor = self._make_group_compressor() self._unadded_refs = {} keys_to_add = [] def flush(block): bytes_len, chunks = block.to_chunks() self._compressor = self._make_group_compressor() # Note: At this point we still have 1 copy of the fulltext (in # record and the var 'bytes'), and this generates 2 copies of # the compressed text (one for bytes, one in chunks) # TODO: Figure out how to indicate that we would be happy to free # the fulltext content at this point. Note that sometimes we # will want it later (streaming CHK pages), but most of the # time we won't (everything else) _index, start, length = self._access.add_raw_record(None, bytes_len, chunks) nodes = [] for key, reads, refs in keys_to_add: nodes.append((key, b"%d %d %s" % (start, length, reads), refs)) self._index.add_records(nodes, random_id=random_id) self._unadded_refs = {} del keys_to_add[:] last_prefix = None max_fulltext_len = 0 max_fulltext_prefix = None insert_manager = None block_start = None block_length = None # XXX: TODO: remove this, it is just for safety checking for now inserted_keys = set() reuse_this_block = reuse_blocks for record in stream: # Raise an error when a record is missing. if record.storage_kind == "absent": raise RevisionNotPresent(record.key, self) if random_id: if record.key in inserted_keys: logger.info( "Insert claimed random_id=True, but then inserted %r two times", record.key, ) continue inserted_keys.add(record.key) if reuse_blocks: # If the reuse_blocks flag is set, check to see if we can just # copy a groupcompress block as-is. # We only check on the first record (groupcompress-block) not # on all of the (groupcompress-block-ref) entries. # The reuse_this_block flag is then kept for as long as if record.storage_kind == "groupcompress-block": # Check to see if we really want to re-use this block insert_manager = record._manager reuse_this_block = insert_manager.check_is_well_utilized() else: reuse_this_block = False if reuse_this_block: # We still want to reuse this block if record.storage_kind == "groupcompress-block": # Insert the raw block into the target repo insert_manager = record._manager bytes_len, chunks = record._manager._block.to_chunks() _, start, length = self._access.add_raw_record( None, bytes_len, chunks ) block_start = start block_length = length if record.storage_kind in ( "groupcompress-block", "groupcompress-block-ref", ): if insert_manager is None: raise AssertionError("No insert_manager set") if insert_manager is not record._manager: raise AssertionError( "insert_manager does not match" " the current record, we cannot be positive" " that the appropriate content was inserted." ) value = b"%d %d %d %d" % ( block_start, block_length, record._start, record._end, ) nodes = [(record.key, value, (record.parents,))] # TODO: Consider buffering up many nodes to be added, not # sure how much overhead this has, but we're seeing # ~23s / 120s in add_records calls self._index.add_records(nodes, random_id=random_id) continue try: chunks = record.get_bytes_as("chunked") except UnavailableRepresentation: adapter_key = record.storage_kind, "chunked" adapter = get_adapter(adapter_key) chunks = adapter.get_bytes(record, "chunked") chunks_len = record.size if chunks_len is None: chunks_len = sum(map(len, chunks)) if len(record.key) > 1: prefix = record.key[0] soft = prefix == last_prefix else: prefix = None soft = False if max_fulltext_len < chunks_len: max_fulltext_len = chunks_len max_fulltext_prefix = prefix (found_sha1, start_point, end_point, _type) = self._compressor.compress( record.key, chunks, chunks_len, record.sha1, soft=soft, nostore_sha=nostore_sha, ) # delta_ratio = float(chunks_len) / (end_point - start_point) # Check if we want to continue to include that text if prefix == max_fulltext_prefix and end_point < 2 * max_fulltext_len: # As long as we are on the same file_id, we will fill at least # 2 * max_fulltext_len start_new_block = False elif end_point > 4 * 1024 * 1024: start_new_block = True elif ( prefix is not None and prefix != last_prefix and end_point > 2 * 1024 * 1024 ): start_new_block = True else: start_new_block = False last_prefix = prefix if start_new_block: flush(self._compressor.flush_without_last()) max_fulltext_len = chunks_len (found_sha1, start_point, end_point, _type) = self._compressor.compress( record.key, chunks, chunks_len, record.sha1 ) if record.key[-1] is None: key = record.key[:-1] + (b"sha1:" + found_sha1,) else: key = record.key self._unadded_refs[key] = record.parents yield found_sha1, chunks_len if record.parents is not None: parents = tuple([tuple(p) for p in record.parents]) else: parents = None refs = (parents,) keys_to_add.append((key, b"%d %d" % (start_point, end_point), refs)) if len(keys_to_add): flush(self._compressor.flush()) self._compressor = None def iter_lines_added_or_present_in_keys(self, keys, pb=None): r"""Iterate over the lines in the versioned files from keys. This may return lines from other keys. Each item the returned iterator yields is a tuple of a line and a text version that that line is present in (not introduced in). Ordering of results is in whatever order is most suitable for the underlying storage format. If a progress bar is supplied, it may be used to indicate progress. The caller is responsible for cleaning up progress bars (because this is an iterator). Notes: * Lines are normalised by the underlying store: they will all have \n terminators. * Lines are returned in arbitrary order. :return: An iterator over (line, key). """ keys = set(keys) total = len(keys) # we don't care about inclusions, the caller cares. # but we need to setup a list of records to visit. # we need key, position, length for key_idx, record in enumerate( self.get_record_stream(keys, "unordered", True) ): # XXX: todo - optimise to use less than full texts. key = record.key if pb is not None: pb.update("Walking content", key_idx, total) if record.storage_kind == "absent": raise RevisionNotPresent(key, self) for line in record.iter_bytes_as("lines"): yield line, key if pb is not None: pb.update("Walking content", total, total) def keys(self): """See VersionedFiles.keys.""" evil_logger.debug("keys scales with size of history") sources = [self._index] + self._immediate_fallback_vfs result = set() for source in sources: result.update(source.keys()) return result class _GCBuildDetails: """A blob of data about the build details. This stores the minimal data, which then allows compatibility with the old api, without taking as much memory. """ __slots__ = ( "_basis_end", "_delta_end", "_group_end", "_group_start", "_index", "_parents", ) method = "group" compression_parent = None def __init__(self, parents, position_info): self._parents = parents ( self._index, self._group_start, self._group_end, self._basis_end, self._delta_end, ) = position_info def __repr__(self): return f"{self.__class__.__name__}({self.index_memo}, {self._parents})" @property def index_memo(self): return ( self._index, self._group_start, self._group_end, self._basis_end, self._delta_end, ) @property def record_details(self): return (self.method, None) def __getitem__(self, offset): """Compatibility thunk to act like a tuple.""" if offset == 0: return self.index_memo elif offset == 1: return self.compression_parent # Always None elif offset == 2: return self._parents elif offset == 3: return self.record_details else: raise IndexError("offset out of range") def __len__(self): return 4 class _GCGraphIndex: """Mapper from GroupCompressVersionedFiles needs into GraphIndex storage.""" def __init__( self, graph_index, is_locked, parents=True, add_callback=None, track_external_parent_refs=False, inconsistency_fatal=True, track_new_keys=False, ): """Construct a _GCGraphIndex on a graph_index. :param graph_index: An implementation of bzrformats.index.GraphIndex. :param is_locked: A callback, returns True if the index is locked and thus usable. :param parents: If True, record knits parents, if not do not record parents. :param add_callback: If not None, allow additions to the index and call this callback with a list of added GraphIndex nodes: [(node, value, node_refs), ...] :param track_external_parent_refs: As keys are added, keep track of the keys they reference, so that we can query get_missing_parents(), etc. :param inconsistency_fatal: When asked to add records that are already present, and the details are inconsistent with the existing record, raise an exception instead of warning (and skipping the record). """ self._add_callback = add_callback self._graph_index = graph_index self._parents = parents self.has_graph = parents self._is_locked = is_locked self._inconsistency_fatal = inconsistency_fatal # GroupCompress records tend to have the same 'group' start + offset # repeated over and over, this creates a surplus of ints self._int_cache = {} if track_external_parent_refs: self._key_dependencies = _KeyRefs(track_new_keys=track_new_keys) else: self._key_dependencies = None def add_records(self, records, random_id=False): """Add multiple records to the index. This function does not insert data into the Immutable GraphIndex backing the KnitGraphIndex, instead it prepares data for insertion by the caller and checks that it is safe to insert then calls self._add_callback with the prepared GraphIndex nodes. :param records: a list of tuples: (key, options, access_memo, parents). :param random_id: If True the ids being added were randomly generated and no check for existence will be performed. """ if not self._add_callback: raise ReadOnlyError(self) # we hope there are no repositories with inconsistent parentage # anymore. changed = False keys = {} for key, value, refs in records: if not self._parents and refs: for ref in refs: if ref: from . import knit raise knit.KnitCorrupt( self, "attempt to add node with parents in parentless index.", ) refs = () changed = True keys[key] = (value, refs) # check for dups if not random_id: present_nodes = self._get_entries(keys) for _index, key, value, node_refs in present_nodes: # Sometimes these are passed as a list rather than a tuple node_refs = as_tuples(node_refs) passed = as_tuples(keys[key]) if node_refs != passed[1]: details = f"{key} {value, node_refs} {passed}" if self._inconsistency_fatal: from . import knit raise knit.KnitCorrupt( self, "inconsistent details in add_records: {}".format(details), ) else: logger.warning( "inconsistent details in skipped record: %s", details ) del keys[key] changed = True if changed: result = [] if self._parents: for key, (value, node_refs) in keys.items(): result.append((key, value, node_refs)) else: for key, (value, node_refs) in keys.items(): # noqa: B007 result.append((key, value)) records = result key_dependencies = self._key_dependencies if key_dependencies is not None: if self._parents: for key, value, refs in records: # noqa: B007 parents = refs[0] key_dependencies.add_references(key, parents) else: for key, value, refs in records: # noqa: B007 new_keys.add_key(key) self._add_callback(records) def _check_read(self): """Raise an exception if reads are not permitted.""" if not self._is_locked(): raise ObjectNotLocked(self) def _check_write_ok(self): """Raise an exception if writes are not permitted.""" if not self._is_locked(): raise ObjectNotLocked(self) def _get_entries(self, keys, check_present=False): """Get the entries for keys. Note: Callers are responsible for checking that the index is locked before calling this method. :param keys: An iterable of index key tuples. """ keys = set(keys) found_keys = set() if self._parents: for node in self._graph_index.iter_entries(keys): yield node found_keys.add(node[1]) else: # adapt parentless index to the rest of the code. for node in self._graph_index.iter_entries(keys): yield node[0], node[1], node[2], () found_keys.add(node[1]) if check_present: missing_keys = keys.difference(found_keys) if missing_keys: raise RevisionNotPresent(missing_keys.pop(), self) def find_ancestry(self, keys): """See CombinedGraphIndex.find_ancestry.""" return self._graph_index.find_ancestry(keys, 0) def get_parent_map(self, keys): """Get a map of the parents of keys. :param keys: The keys to look up parents for. :return: A mapping from keys to parents. Absent keys are absent from the mapping. """ self._check_read() nodes = self._get_entries(keys) result = {} if self._parents: for node in nodes: result[node[1]] = node[3][0] else: for node in nodes: result[node[1]] = None return result def get_missing_parents(self): """Return the keys of missing parents.""" # Copied from _KnitGraphIndex.get_missing_parents # We may have false positives, so filter those out. self._key_dependencies.satisfy_refs_for_keys( self.get_parent_map(self._key_dependencies.get_unsatisfied_refs()) ) return frozenset(self._key_dependencies.get_unsatisfied_refs()) def get_build_details(self, keys): """Get the various build details for keys. Ghosts are omitted from the result. :param keys: An iterable of keys. :return: A dict of key: (index_memo, compression_parent, parents, record_details). * index_memo: opaque structure to pass to read_records to extract the raw data * compression_parent: Content that this record is built upon, may be None * parents: Logical parents of this node * record_details: extra information about the content which needs to be passed to Factory.parse_record """ self._check_read() result = {} entries = self._get_entries(keys) for entry in entries: key = entry[1] parents = None if not self._parents else entry[3][0] details = _GCBuildDetails(parents, self._node_to_position(entry)) result[key] = details return result def keys(self): """Get all the keys in the collection. The keys are not ordered. """ self._check_read() return [node[1] for node in self._graph_index.iter_all_entries()] def _node_to_position(self, node): """Convert an index value to position details.""" bits = node[2].split(b" ") # It would be nice not to read the entire gzip. # start and stop are put into _int_cache because they are very common. # They define the 'group' that an entry is in, and many groups can have # thousands of objects. # Branching Launchpad, for example, saves ~600k integers, at 12 bytes # each, or about 7MB. Note that it might be even more when you consider # how PyInt is allocated in separate slabs. And you can't return a slab # to the OS if even 1 int on it is in use. Note though that Python uses # a LIFO when re-using PyInt slots, which might cause more # fragmentation. start = int(bits[0]) start = self._int_cache.setdefault(start, start) stop = int(bits[1]) stop = self._int_cache.setdefault(stop, stop) basis_end = int(bits[2]) delta_end = int(bits[3]) # We can't use tuple here, because node[0] is a BTreeGraphIndex # instance... return (node[0], start, stop, basis_end, delta_end) def scan_unvalidated_index(self, graph_index): """Inform this _GCGraphIndex that there is an unvalidated index. This allows this _GCGraphIndex to keep track of any missing compression parents we may want to have filled in to make those indices valid. It also allows _GCGraphIndex to track any new keys. :param graph_index: A GraphIndex """ key_dependencies = self._key_dependencies if key_dependencies is None: return for node in graph_index.iter_all_entries(): # Add parent refs from graph_index (and discard parent refs # that the graph_index has). key_dependencies.add_references(node[1], node[3][0]) from ._bzr_rs import groupcompress encode_base128_int = groupcompress.encode_base128_int encode_copy_instruction = groupcompress.encode_copy_instruction LinesDeltaIndex = groupcompress.LinesDeltaIndex make_line_delta = groupcompress.make_line_delta make_rabin_delta = groupcompress.make_rabin_delta apply_delta = groupcompress.apply_delta apply_delta_to_source = groupcompress.apply_delta_to_source decode_base128_int = groupcompress.decode_base128_int decode_copy_instruction = groupcompress.decode_copy_instruction encode_base128_int = groupcompress.encode_base128_int try: from ._groupcompress_pyx import DeltaIndex GroupCompressor = PyrexGroupCompressor # type: ignore except ModuleNotFoundError as e: osutils.failed_to_load_extension(e) GroupCompressor = PythonGroupCompressor # type: ignore bzrformats_3.4.0.orig/bzrformats/index.py0000644000000000000000000023634015162115103015533 0ustar00# Copyright (C) 2007-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Indexing facilities.""" __all__ = [ "CombinedGraphIndex", "GraphIndex", "GraphIndexBuilder", "GraphIndexPrefixAdapter", "InMemoryGraphIndex", ] import logging import re from bisect import bisect_right from io import BytesIO from . import revision as _mod_revision from .errors import BzrFormatsError from .transport import TransportNoSuchFile logger = logging.getLogger("bzrformats.index") evil_logger = logging.getLogger("bzrformats.evil") _HEADER_READV = (0, 200) _OPTION_KEY_ELEMENTS = b"key_elements=" _OPTION_LEN = b"len=" _OPTION_NODE_REFS = b"node_ref_lists=" _SIGNATURE = b"Bazaar Graph Index 1\n" class BadIndexFormatSignature(BzrFormatsError): _fmt = "%(value)s is not an index of type %(_type)s." def __init__(self, value, _type): super().__init__() self.value = value self._type = _type class BadIndexData(BzrFormatsError): _fmt = "Error in data for index %(value)s." def __init__(self, value): super().__init__() self.value = value class BadIndexDuplicateKey(BzrFormatsError): _fmt = "The key '%(key)s' is already in index '%(index)s'." def __init__(self, key, index): super().__init__() self.key = key self.index = index class BadIndexKey(BzrFormatsError): _fmt = "The key '%(key)s' is not a valid key." def __init__(self, key): super().__init__() self.key = key class BadIndexOptions(BzrFormatsError): _fmt = "Could not parse options for index %(value)s." def __init__(self, value): super().__init__() self.value = value class BadIndexValue(BzrFormatsError): _fmt = "The value '%(value)s' is not a valid value." def __init__(self, value): super().__init__() self.value = value _whitespace_re = re.compile(b"[\t\n\x0b\x0c\r\x00 ]") _newline_null_re = re.compile(b"[\n\0]") def _has_key_from_parent_map(self, key): """Check if this index has one key. If it's possible to check for multiple keys at once through calling get_parent_map that should be faster. """ return key in self.get_parent_map([key]) def _missing_keys_from_parent_map(self, keys): return set(keys) - set(self.get_parent_map(keys)) class GraphIndexBuilder: """A builder that can build a GraphIndex. The resulting graph has the structure:: _SIGNATURE OPTIONS NODES NEWLINE _SIGNATURE := 'Bazaar Graph Index 1' NEWLINE OPTIONS := 'node_ref_lists=' DIGITS NEWLINE NODES := NODE* NODE := KEY NULL ABSENT? NULL REFERENCES NULL VALUE NEWLINE KEY := Not-whitespace-utf8 ABSENT := 'a' REFERENCES := REFERENCE_LIST (TAB REFERENCE_LIST){node_ref_lists - 1} REFERENCE_LIST := (REFERENCE (CR REFERENCE)*)? REFERENCE := DIGITS ; digits is the byte offset in the index of the ; referenced key. VALUE := no-newline-no-null-bytes """ def __init__(self, reference_lists=0, key_elements=1): """Create a GraphIndex builder. :param reference_lists: The number of node references lists for each entry. :param key_elements: The number of bytestrings in each key. """ self.reference_lists = reference_lists # A dict of {key: (absent, ref_lists, value)} self._nodes = {} # Keys that are referenced but not actually present in this index self._absent_keys = set() self._nodes_by_key = None self._key_length = key_elements self._optimize_for_size = False self._combine_backing_indices = True def _check_key(self, key): """Raise BadIndexKey if key is not a valid key for this index.""" if type(key) not in (tuple,): raise BadIndexKey(key) if self._key_length != len(key): raise BadIndexKey(key) for element in key: if ( not element or not isinstance(element, bytes) or _whitespace_re.search(element) is not None ): raise BadIndexKey(key) def _external_references(self): """Return references that are not present in this index.""" keys = set() refs = set() # TODO: JAM 2008-11-21 This makes an assumption about how the reference # lists are used. It is currently correct for pack-0.92 through # 1.9, which use the node references (3rd column) second # reference list as the compression parent. Perhaps this should # be moved into something higher up the stack, since it # makes assumptions about how the index is used. if self.reference_lists > 1: for node in self.iter_all_entries(): keys.add(node[1]) refs.update(node[3][1]) return refs - keys else: # If reference_lists == 0 there can be no external references, and # if reference_lists == 1, then there isn't a place to store the # compression parent return set() def _get_nodes_by_key(self): if self._nodes_by_key is None: nodes_by_key = {} if self.reference_lists: for key, (absent, references, value) in self._nodes.items(): if absent: continue key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key, value, references else: for key, (absent, _references, value) in self._nodes.items(): if absent: continue key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key, value self._nodes_by_key = nodes_by_key return self._nodes_by_key def _update_nodes_by_key(self, key, value, node_refs): """Update the _nodes_by_key dict with a new key. For a key of (foo, bar, baz) create _nodes_by_key[foo][bar][baz] = key_value """ if self._nodes_by_key is None: return key_dict = self._nodes_by_key if self.reference_lists: key_value = (key, value, node_refs) else: key_value = (key, value) for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key_value def _check_key_ref_value(self, key, references, value): """Check that 'key' and 'references' are all valid. :param key: A key tuple. Must conform to the key interface (be a tuple, be of the right length, not have any whitespace or nulls in any key element.) :param references: An iterable of reference lists. Something like [[(ref, key)], [(ref, key), (other, key)]] :param value: The value associate with this key. Must not contain newlines or null characters. :return: (node_refs, absent_references) * node_refs: basically a packed form of 'references' where all iterables are tuples * absent_references: reference keys that are not in self._nodes. This may contain duplicates if the same key is referenced in multiple lists. """ self._check_key(key) if _newline_null_re.search(value) is not None: raise BadIndexValue(value) if len(references) != self.reference_lists: raise BadIndexValue(references) node_refs = [] absent_references = [] for reference_list in references: for reference in reference_list: # If reference *is* in self._nodes, then we know it has already # been checked. if reference not in self._nodes: self._check_key(reference) absent_references.append(reference) reference_list = tuple([tuple(ref) for ref in reference_list]) node_refs.append(reference_list) return tuple(node_refs), absent_references def add_node(self, key, value, references=()): r"""Add a node to the index. :param key: The key. keys are non-empty tuples containing as many whitespace-free utf8 bytestrings as the key length defined for this index. :param references: An iterable of iterables of keys. Each is a reference to another key. :param value: The value to associate with the key. It may be any bytes as long as it does not contain \0 or \n. """ (node_refs, absent_references) = self._check_key_ref_value( key, references, value ) if key in self._nodes and self._nodes[key][0] != b"a": raise BadIndexDuplicateKey(key, self) for reference in absent_references: # There may be duplicates, but I don't think it is worth worrying # about self._nodes[reference] = (b"a", (), b"") self._absent_keys.update(absent_references) self._absent_keys.discard(key) self._nodes[key] = (b"", node_refs, value) if self._nodes_by_key is not None and self._key_length > 1: self._update_nodes_by_key(key, value, node_refs) def clear_cache(self): """See GraphIndex.clear_cache(). This is a no-op, but we need the api to conform to a generic 'Index' abstraction. """ def finish(self): """Finish the index. :returns: cBytesIO holding the full context of the index as it should be written to disk. """ lines = [_SIGNATURE] lines.append(b"%s%d\n" % (_OPTION_NODE_REFS, self.reference_lists)) lines.append(b"%s%d\n" % (_OPTION_KEY_ELEMENTS, self._key_length)) key_count = len(self._nodes) - len(self._absent_keys) lines.append(b"%s%d\n" % (_OPTION_LEN, key_count)) prefix_length = sum(len(x) for x in lines) # references are byte offsets. To avoid having to do nasty # polynomial work to resolve offsets (references to later in the # file cannot be determined until all the inbetween references have # been calculated too) we pad the offsets with 0's to make them be # of consistent length. Using binary offsets would break the trivial # file parsing. # to calculate the width of zero's needed we do three passes: # one to gather all the non-reference data and the number of references. # one to pad all the data with reference-length and determine entry # addresses. # One to serialise. # forward sorted by key. In future we may consider topological sorting, # at the cost of table scans for direct lookup, or a second index for # direct lookup nodes = sorted(self._nodes.items()) # if we do not prepass, we don't know how long it will be up front. expected_bytes = None # we only need to pre-pass if we have reference lists at all. if self.reference_lists: key_offset_info = [] non_ref_bytes = prefix_length total_references = 0 # TODO use simple multiplication for the constants in this loop. for key, (absent, references, value) in nodes: # record the offset known *so far* for this key: # the non reference bytes to date, and the total references to # date - saves reaccumulating on the second pass key_offset_info.append((key, non_ref_bytes, total_references)) # key is literal, value is literal, there are 3 null's, 1 NL # key is variable length tuple, \x00 between elements non_ref_bytes += sum(len(element) for element in key) if self._key_length > 1: non_ref_bytes += self._key_length - 1 # value is literal bytes, there are 3 null's, 1 NL. non_ref_bytes += len(value) + 3 + 1 # one byte for absent if set. if absent: non_ref_bytes += 1 elif self.reference_lists: # (ref_lists -1) tabs non_ref_bytes += self.reference_lists - 1 # (ref-1 cr's per ref_list) for ref_list in references: # how many references across the whole file? total_references += len(ref_list) # accrue reference separators if ref_list: non_ref_bytes += len(ref_list) - 1 # how many digits are needed to represent the total byte count? digits = 1 possible_total_bytes = non_ref_bytes + total_references * digits while 10**digits < possible_total_bytes: digits += 1 possible_total_bytes = non_ref_bytes + total_references * digits expected_bytes = possible_total_bytes + 1 # terminating newline # resolve key addresses. key_addresses = {} for key, non_ref_bytes, total_references in key_offset_info: key_addresses[key] = non_ref_bytes + total_references * digits # serialise format_string = b"%%0%dd" % digits for key, (absent, references, value) in nodes: flattened_references = [] for ref_list in references: ref_addresses = [] for reference in ref_list: ref_addresses.append(format_string % key_addresses[reference]) flattened_references.append(b"\r".join(ref_addresses)) string_key = b"\x00".join(key) lines.append( b"%s\x00%s\x00%s\x00%s\n" % (string_key, absent, b"\t".join(flattened_references), value) ) lines.append(b"\n") result = BytesIO(b"".join(lines)) if expected_bytes and len(result.getvalue()) != expected_bytes: raise errors.BzrError( "Failed index creation. Internal error:" " mismatched output length and expected length: %d %d" % (len(result.getvalue()), expected_bytes) ) return result def set_optimize(self, for_size=None, combine_backing_indices=None): """Change how the builder tries to optimize the result. :param for_size: Tell the builder to try and make the index as small as possible. :param combine_backing_indices: If the builder spills to disk to save memory, should the on-disk indices be combined. Set to True if you are going to be probing the index, but to False if you are not. (If you are not querying, then the time spent combining is wasted.) :return: None """ # GraphIndexBuilder itself doesn't pay attention to the flag yet, but # other builders do. if for_size is not None: self._optimize_for_size = for_size if combine_backing_indices is not None: self._combine_backing_indices = combine_backing_indices def find_ancestry(self, keys, ref_list_num): """See CombinedGraphIndex.find_ancestry().""" pending = set(keys) parent_map = {} missing_keys = set() while pending: next_pending = set() for _, key, _value, ref_lists in self.iter_entries(pending): parent_keys = ref_lists[ref_list_num] parent_map[key] = parent_keys next_pending.update([p for p in parent_keys if p not in parent_map]) missing_keys.update(pending.difference(parent_map)) pending = next_pending return parent_map, missing_keys class GraphIndex: """An index for data with embedded graphs. The index maps keys to a list of key reference lists, and a value. Each node has the same number of key reference lists. Each key reference list can be empty or an arbitrary length. The value is an opaque NULL terminated string without any newlines. The storage of the index is hidden in the interface: keys and key references are always tuples of bytestrings, never the internal representation (e.g. dictionary offsets). It is presumed that the index will not be mutated - it is static data. Successive iter_all_entries calls will read the entire index each time. Additionally, iter_entries calls will read the index linearly until the desired keys are found. XXX: This must be fixed before the index is suitable for production use. :XXX """ def __init__(self, transport, name, size, unlimited_cache=False, offset=0): """Open an index called name on transport. :param transport: A Transport. :param name: A path to provide to transport API calls. :param size: The size of the index in bytes. This is used for bisection logic to perform partial index reads. While the size could be obtained by statting the file this introduced an additional round trip as well as requiring stat'able transports, both of which are avoided by having it supplied. If size is None, then bisection support will be disabled and accessing the index will just stream all the data. :param offset: Instead of starting the index data at offset 0, start it at an arbitrary offset. """ self._transport = transport self._name = name # Becomes a dict of key:(value, reference-list-byte-locations) used by # the bisection interface to store parsed but not resolved keys. self._bisect_nodes = None # Becomes a dict of key:(value, reference-list-keys) which are ready to # be returned directly to callers. self._nodes = None # a sorted list of slice-addresses for the parsed bytes of the file. # e.g. (0,1) would mean that byte 0 is parsed. self._parsed_byte_map = [] # a sorted list of keys matching each slice address for parsed bytes # e.g. (None, 'foo@bar') would mean that the first byte contained no # key, and the end byte of the slice is the of the data for 'foo@bar' self._parsed_key_map = [] self._key_count = None self._keys_by_offset = None self._nodes_by_key = None self._size = size # The number of bytes we've read so far in trying to process this file self._bytes_read = 0 self._base_offset = offset def __eq__(self, other): """Equal when self and other were created with the same parameters.""" return ( isinstance(self, type(other)) and self._transport == other._transport and self._name == other._name and self._size == other._size ) def __ne__(self, other): """Return True if self != other.""" return not self.__eq__(other) def __lt__(self, other): """Return True if self < other for ordering purposes.""" # We don't really care about the order, just that there is an order. if not isinstance(other, GraphIndex) and not isinstance( other, InMemoryGraphIndex ): raise TypeError(other) return hash(self) < hash(other) def __hash__(self): """Return hash value for the graph index.""" return hash((type(self), self._transport, self._name, self._size)) def __repr__(self): """Return string representation of the graph index.""" return f"{self.__class__.__name__}({self._transport.abspath(self._name)!r})" def _buffer_all(self, stream=None): """Buffer all the index data. Mutates self._nodes and self.keys_by_offset. """ if self._nodes is not None: # We already did this return logger.debug("Reading entire index %s", self._transport.abspath(self._name)) if stream is None: stream = self._transport.get(self._name) if self._base_offset != 0 or not hasattr(stream, "readline"): # This is wasteful, but it is better than dealing with # adjusting all the offsets, etc. stream = BytesIO(stream.read()[self._base_offset :]) try: self._read_prefix(stream) self._expected_elements = 3 + self._key_length # raw data keyed by offset self._keys_by_offset = {} # ready-to-return key:value or key:value, node_ref_lists self._nodes = {} self._nodes_by_key = None trailers = 0 pos = stream.tell() lines = stream.read().split(b"\n") finally: stream.close() del lines[-1] _, _, _, trailers = self._parse_lines(lines, pos) for key, absent, references, value in self._keys_by_offset.values(): if absent: continue # resolve references: if self.node_ref_lists: node_value = (value, self._resolve_references(references)) else: node_value = value self._nodes[key] = node_value # cache the keys for quick set intersections if trailers != 1: # there must be one line - the empty trailer line. raise BadIndexData(self) def clear_cache(self): """Clear out any cached/memoized values. This can be called at any time, but generally it is used when we have extracted some information, but don't expect to be requesting any more from this index. """ def external_references(self, ref_list_num): """Return references that are not present in this index.""" self._buffer_all() if ref_list_num + 1 > self.node_ref_lists: raise ValueError( "No ref list %d, index has %d ref lists" % (ref_list_num, self.node_ref_lists) ) refs = set() nodes = self._nodes for _key, (_value, ref_lists) in nodes.items(): ref_list = ref_lists[ref_list_num] refs.update([ref for ref in ref_list if ref not in nodes]) return refs def _get_nodes_by_key(self): if self._nodes_by_key is None: nodes_by_key = {} if self.node_ref_lists: for key, (value, references) in self._nodes.items(): key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key, value, references else: for key, value in self._nodes.items(): key_dict = nodes_by_key for subkey in key[:-1]: key_dict = key_dict.setdefault(subkey, {}) key_dict[key[-1]] = key, value self._nodes_by_key = nodes_by_key return self._nodes_by_key def iter_all_entries(self): """Iterate over all keys within the index. :return: An iterable of (index, key, value) or (index, key, value, reference_lists). The former tuple is used when there are no reference lists in the index, making the API compatible with simple key:value index types. There is no defined order for the result iteration - it will be in the most efficient order for the index. """ evil_logger.debug("iter_all_entries scales with size of history.") if self._nodes is None: self._buffer_all() if self.node_ref_lists: for key, (value, node_ref_lists) in self._nodes.items(): yield self, key, value, node_ref_lists else: for key, value in self._nodes.items(): yield self, key, value def _read_prefix(self, stream): signature = stream.read(len(self._signature())) if not signature == self._signature(): raise BadIndexFormatSignature(self._name, GraphIndex) options_line = stream.readline() if not options_line.startswith(_OPTION_NODE_REFS): raise BadIndexOptions(self) try: self.node_ref_lists = int(options_line[len(_OPTION_NODE_REFS) : -1]) except ValueError as e: raise BadIndexOptions(self) from e options_line = stream.readline() if not options_line.startswith(_OPTION_KEY_ELEMENTS): raise BadIndexOptions(self) try: self._key_length = int(options_line[len(_OPTION_KEY_ELEMENTS) : -1]) except ValueError as e: raise BadIndexOptions(self) from e options_line = stream.readline() if not options_line.startswith(_OPTION_LEN): raise BadIndexOptions(self) try: self._key_count = int(options_line[len(_OPTION_LEN) : -1]) except ValueError as e: raise BadIndexOptions(self) from e def _resolve_references(self, references): """Return the resolved key references for references. References are resolved by looking up the location of the key in the _keys_by_offset map and substituting the key name, preserving ordering. :param references: An iterable of iterables of key locations. e.g. [[123, 456], [123]] :return: A tuple of tuples of keys. """ node_refs = [] for ref_list in references: node_refs.append(tuple([self._keys_by_offset[ref][0] for ref in ref_list])) return tuple(node_refs) @staticmethod def _find_index(range_map, key): """Helper for the _parsed_*_index calls. Given a range map - [(start, end), ...], finds the index of the range in the map for key if it is in the map, and if it is not there, the immediately preceeding range in the map. """ result = bisect_right(range_map, key) - 1 if result + 1 < len(range_map): # check the border condition, it may be in result + 1 if range_map[result + 1][0] == key[0]: return result + 1 return result def _parsed_byte_index(self, offset): """Return the index of the entry immediately before offset. e.g. if the parsed map has regions 0,10 and 11,12 parsed, meaning that there is one unparsed byte (the 11th, addressed as[10]). then: asking for 0 will return 0 asking for 10 will return 0 asking for 11 will return 1 asking for 12 will return 1 """ key = (offset, 0) return self._find_index(self._parsed_byte_map, key) def _parsed_key_index(self, key): """Return the index of the entry immediately before key. e.g. if the parsed map has regions (None, 'a') and ('b','c') parsed, meaning that keys from None to 'a' inclusive, and 'b' to 'c' inclusive have been parsed, then: asking for '' will return 0 asking for 'a' will return 0 asking for 'b' will return 1 asking for 'e' will return 1 """ search_key = (key, b"") return self._find_index(self._parsed_key_map, search_key) def _is_parsed(self, offset): """Returns True if offset has been parsed.""" index = self._parsed_byte_index(offset) if index == len(self._parsed_byte_map): return offset < self._parsed_byte_map[index - 1][1] start, end = self._parsed_byte_map[index] return offset >= start and offset < end def _iter_entries_from_total_buffer(self, keys): """Iterate over keys when the entire index is parsed.""" # Note: See the note in BTreeBuilder.iter_entries for why we don't use # .intersection() here nodes = self._nodes keys = [key for key in keys if key in nodes] if self.node_ref_lists: for key in keys: value, node_refs = nodes[key] yield self, key, value, node_refs else: for key in keys: yield self, key, nodes[key] def iter_entries(self, keys): """Iterate over keys within the index. :param keys: An iterable providing the keys to be retrieved. :return: An iterable as per iter_all_entries, but restricted to the keys supplied. No additional keys will be returned, and every key supplied that is in the index will be returned. """ from . import bisect_multi keys = set(keys) if not keys: return [] if self._size is None and self._nodes is None: self._buffer_all() # We fit about 20 keys per minimum-read (4K), so if we are looking for # more than 1/20th of the index its likely (assuming homogenous key # spread) that we'll read the entire index. If we're going to do that, # buffer the whole thing. A better analysis might take key spread into # account - but B+Tree indices are better anyway. # We could look at all data read, and use a threshold there, which will # trigger on ancestry walks, but that is not yet fully mapped out. if self._nodes is None and len(keys) * 20 > self.key_count(): self._buffer_all() if self._nodes is not None: return self._iter_entries_from_total_buffer(keys) else: return ( result[1] for result in bisect_multi.bisect_multi_bytes( self._lookup_keys_via_location, self._size, keys ) ) def iter_entries_prefix(self, keys): """Iterate over keys within the index using prefix matching. Prefix matching is applied within the tuple of a key, not to within the bytestring of each key element. e.g. if you have the keys ('foo', 'bar'), ('foobar', 'gam') and do a prefix search for ('foo', None) then only the former key is returned. WARNING: Note that this method currently causes a full index parse unconditionally (which is reasonably appropriate as it is a means for thunking many small indices into one larger one and still supplies iter_all_entries at the thunk layer). :param keys: An iterable providing the key prefixes to be retrieved. Each key prefix takes the form of a tuple the length of a key, but with the last N elements 'None' rather than a regular bytestring. The first element cannot be 'None'. :return: An iterable as per iter_all_entries, but restricted to the keys with a matching prefix to those supplied. No additional keys will be returned, and every match that is in the index will be returned. """ keys = set(keys) if not keys: return # load data - also finds key lengths if self._nodes is None: self._buffer_all() if self._key_length == 1: for key in keys: _sanity_check_key(self, key) if self.node_ref_lists: value, node_refs = self._nodes[key] yield self, key, value, node_refs else: yield self, key, self._nodes[key] return nodes_by_key = self._get_nodes_by_key() yield from _iter_entries_prefix(self, nodes_by_key, keys) def _find_ancestors(self, keys, ref_list_num, parent_map, missing_keys): """See BTreeIndex._find_ancestors.""" # The api can be implemented as a trivial overlay on top of # iter_entries, it is not an efficient implementation, but it at least # gets the job done. found_keys = set() search_keys = set() for _index, key, _value, refs in self.iter_entries(keys): parent_keys = refs[ref_list_num] found_keys.add(key) parent_map[key] = parent_keys search_keys.update(parent_keys) # Figure out what, if anything, was missing missing_keys.update(set(keys).difference(found_keys)) search_keys = search_keys.difference(parent_map) return search_keys def key_count(self): """Return an estimate of the number of keys in this index. For GraphIndex the estimate is exact. """ if self._key_count is None: self._read_and_parse([_HEADER_READV]) return self._key_count def _lookup_keys_via_location(self, location_keys): """Public interface for implementing bisection. If _buffer_all has been called, then all the data for the index is in memory, and this method should not be called, as it uses a separate cache because it cannot pre-resolve all indices, which buffer_all does for performance. :param location_keys: A list of location(byte offset), key tuples. :return: A list of (location_key, result) tuples as expected by bzrformats.bisect_multi.bisect_multi_bytes. """ # Possible improvements: # - only bisect lookup each key once # - sort the keys first, and use that to reduce the bisection window # ----- # this progresses in three parts: # read data # parse it # attempt to answer the question from the now in memory data. # build the readv request # for each location, ask for 800 bytes - much more than rows we've seen # anywhere. readv_ranges = [] for location, key in location_keys: # can we answer from cache? if self._bisect_nodes and key in self._bisect_nodes: # We have the key parsed. continue index = self._parsed_key_index(key) if ( len(self._parsed_key_map) and self._parsed_key_map[index][0] <= key and ( self._parsed_key_map[index][1] >= key or # end of the file has been parsed self._parsed_byte_map[index][1] == self._size ) ): # the key has been parsed, so no lookup is needed even if its # not present. continue # - if we have examined this part of the file already - yes index = self._parsed_byte_index(location) if ( len(self._parsed_byte_map) and self._parsed_byte_map[index][0] <= location and self._parsed_byte_map[index][1] > location ): # the byte region has been parsed, so no read is needed. continue length = 800 if location + length > self._size: length = self._size - location # todo, trim out parsed locations. if length > 0: readv_ranges.append((location, length)) # read the header if needed if self._bisect_nodes is None: readv_ranges.append(_HEADER_READV) self._read_and_parse(readv_ranges) result = [] if self._nodes is not None: # _read_and_parse triggered a _buffer_all because we requested the # whole data range for location, key in location_keys: if key not in self._nodes: # not present result.append(((location, key), False)) elif self.node_ref_lists: value, refs = self._nodes[key] result.append(((location, key), (self, key, value, refs))) else: result.append(((location, key), (self, key, self._nodes[key]))) return result # generate results: # - figure out <, >, missing, present # - result present references so we can return them. # keys that we cannot answer until we resolve references pending_references = [] pending_locations = set() for location, key in location_keys: # can we answer from cache? if key in self._bisect_nodes: # the key has been parsed, so no lookup is needed if self.node_ref_lists: # the references may not have been all parsed. value, refs = self._bisect_nodes[key] wanted_locations = [] for ref_list in refs: for ref in ref_list: if ref not in self._keys_by_offset: wanted_locations.append(ref) if wanted_locations: pending_locations.update(wanted_locations) pending_references.append((location, key)) continue result.append( ( (location, key), (self, key, value, self._resolve_references(refs)), ) ) else: result.append( ((location, key), (self, key, self._bisect_nodes[key])) ) continue else: # has the region the key should be in, been parsed? index = self._parsed_key_index(key) if self._parsed_key_map[index][0] <= key and ( self._parsed_key_map[index][1] >= key or # end of the file has been parsed self._parsed_byte_map[index][1] == self._size ): result.append(((location, key), False)) continue # no, is the key above or below the probed location: # get the range of the probed & parsed location index = self._parsed_byte_index(location) # if the key is below the start of the range, its below direction = -1 if key < self._parsed_key_map[index][0] else +1 result.append(((location, key), direction)) readv_ranges = [] # lookup data to resolve references for location in pending_locations: length = 800 if location + length > self._size: length = self._size - location # TODO: trim out parsed locations (e.g. if the 800 is into the # parsed region trim it, and dont use the adjust_for_latency # facility) if length > 0: readv_ranges.append((location, length)) self._read_and_parse(readv_ranges) if self._nodes is not None: # The _read_and_parse triggered a _buffer_all, grab the data and # return it for location, key in pending_references: value, refs = self._nodes[key] result.append(((location, key), (self, key, value, refs))) return result for location, key in pending_references: # answer key references we had to look-up-late. value, refs = self._bisect_nodes[key] result.append( ((location, key), (self, key, value, self._resolve_references(refs))) ) return result def _parse_header_from_bytes(self, bytes): """Parse the header from a region of bytes. :param bytes: The data to parse. :return: An offset, data tuple such as readv yields, for the unparsed data. (which may length 0). """ signature = bytes[0 : len(self._signature())] if not signature == self._signature(): raise BadIndexFormatSignature(self._name, GraphIndex) lines = bytes[len(self._signature()) :].splitlines() options_line = lines[0] if not options_line.startswith(_OPTION_NODE_REFS): raise BadIndexOptions(self) try: self.node_ref_lists = int(options_line[len(_OPTION_NODE_REFS) :]) except ValueError as e: raise BadIndexOptions(self) from e options_line = lines[1] if not options_line.startswith(_OPTION_KEY_ELEMENTS): raise BadIndexOptions(self) try: self._key_length = int(options_line[len(_OPTION_KEY_ELEMENTS) :]) except ValueError as e: raise BadIndexOptions(self) from e options_line = lines[2] if not options_line.startswith(_OPTION_LEN): raise BadIndexOptions(self) try: self._key_count = int(options_line[len(_OPTION_LEN) :]) except ValueError as e: raise BadIndexOptions(self) from e # calculate the bytes we have processed header_end = len(signature) + len(lines[0]) + len(lines[1]) + len(lines[2]) + 3 self._parsed_bytes(0, (), header_end, ()) # setup parsing state self._expected_elements = 3 + self._key_length # raw data keyed by offset self._keys_by_offset = {} # keys with the value and node references self._bisect_nodes = {} return header_end, bytes[header_end:] def _parse_region(self, offset, data): """Parse node data returned from a readv operation. :param offset: The byte offset the data starts at. :param data: The data to parse. """ # trim the data. # end first: end = offset + len(data) high_parsed = offset while True: # Trivial test - if the current index's end is within the # low-matching parsed range, we're done. index = self._parsed_byte_index(high_parsed) if end < self._parsed_byte_map[index][1]: return # print "[%d:%d]" % (offset, end), \ # self._parsed_byte_map[index:index + 2] high_parsed, last_segment = self._parse_segment(offset, data, end, index) if last_segment: return def _parse_segment(self, offset, data, end, index): """Parse one segment of data. :param offset: Where 'data' begins in the file. :param data: Some data to parse a segment of. :param end: Where data ends :param index: The current index into the parsed bytes map. :return: True if the parsed segment is the last possible one in the range of data. :return: high_parsed_byte, last_segment. high_parsed_byte is the location of the highest parsed byte in this segment, last_segment is True if the parsed segment is the last possible one in the data block. """ # default is to use all data trim_end = None # accomodate overlap with data before this. if offset < self._parsed_byte_map[index][1]: # overlaps the lower parsed region # skip the parsed data trim_start = self._parsed_byte_map[index][1] - offset # don't trim the start for \n start_adjacent = True elif offset == self._parsed_byte_map[index][1]: # abuts the lower parsed region # use all data trim_start = None # do not trim anything start_adjacent = True else: # does not overlap the lower parsed region # use all data trim_start = None # but trim the leading \n start_adjacent = False if end == self._size: # lines up to the end of all data: # use it all trim_end = None # do not strip to the last \n end_adjacent = True last_segment = True elif index + 1 == len(self._parsed_byte_map): # at the end of the parsed data # use it all trim_end = None # but strip to the last \n end_adjacent = False last_segment = True elif end == self._parsed_byte_map[index + 1][0]: # buts up against the next parsed region # use it all trim_end = None # do not strip to the last \n end_adjacent = True last_segment = True elif end > self._parsed_byte_map[index + 1][0]: # overlaps into the next parsed region # only consider the unparsed data trim_end = self._parsed_byte_map[index + 1][0] - offset # do not strip to the last \n as we know its an entire record end_adjacent = True last_segment = end < self._parsed_byte_map[index + 1][1] else: # does not overlap into the next region # use it all trim_end = None # but strip to the last \n end_adjacent = False last_segment = True # now find bytes to discard if needed if not start_adjacent: # work around python bug in rfind if trim_start is None: trim_start = data.find(b"\n") + 1 else: trim_start = data.find(b"\n", trim_start) + 1 if not (trim_start != 0): raise AssertionError("no \n was present") # print 'removing start', offset, trim_start, repr(data[:trim_start]) if not end_adjacent: # work around python bug in rfind if trim_end is None: trim_end = data.rfind(b"\n") + 1 else: trim_end = data.rfind(b"\n", None, trim_end) + 1 if not (trim_end != 0): raise AssertionError("no \n was present") # print 'removing end', offset, trim_end, repr(data[trim_end:]) # adjust offset and data to the parseable data. trimmed_data = data[trim_start:trim_end] if not (trimmed_data): raise AssertionError( "read unneeded data [%d:%d] from [%d:%d]" % (trim_start, trim_end, offset, offset + len(data)) ) if trim_start: offset += trim_start # print "parsing", repr(trimmed_data) # splitlines mangles the \r delimiters.. don't use it. lines = trimmed_data.split(b"\n") del lines[-1] pos = offset first_key, last_key, nodes, _ = self._parse_lines(lines, pos) for key, value in nodes: self._bisect_nodes[key] = value self._parsed_bytes(offset, first_key, offset + len(trimmed_data), last_key) return offset + len(trimmed_data), last_segment def _parse_lines(self, lines, pos): key = None first_key = None trailers = 0 nodes = [] for line in lines: if line == b"": # must be at the end if self._size and not (self._size == pos + 1): raise AssertionError(f"{self._size} {pos}") trailers += 1 continue elements = line.split(b"\0") if len(elements) != self._expected_elements: raise BadIndexData(self) # keys are tuples. Each element is a string that may occur many # times, so we intern them to save space. AB, RC, 200807 key = tuple(elements[: self._key_length]) if first_key is None: first_key = key absent, references, value = elements[-3:] ref_lists = [] for ref_string in references.split(b"\t"): ref_lists.append( tuple([int(ref) for ref in ref_string.split(b"\r") if ref]) ) ref_lists = tuple(ref_lists) self._keys_by_offset[pos] = (key, absent, ref_lists, value) pos += len(line) + 1 # +1 for the \n if absent: continue node_value = (value, ref_lists) if self.node_ref_lists else value nodes.append((key, node_value)) # print "parsed ", key return first_key, key, nodes, trailers def _parsed_bytes(self, start, start_key, end, end_key): """Mark the bytes from start to end as parsed. Calling self._parsed_bytes(1,2) will mark one byte (the one at offset 1) as parsed. :param start: The start of the parsed region. :param end: The end of the parsed region. """ index = self._parsed_byte_index(start) new_value = (start, end) new_key = (start_key, end_key) if index == -1: # first range parsed is always the beginning. self._parsed_byte_map.insert(index, new_value) self._parsed_key_map.insert(index, new_key) return # four cases: # new region # extend lower region # extend higher region # combine two regions if ( index + 1 < len(self._parsed_byte_map) and self._parsed_byte_map[index][1] == start and self._parsed_byte_map[index + 1][0] == end ): # combine two regions self._parsed_byte_map[index] = ( self._parsed_byte_map[index][0], self._parsed_byte_map[index + 1][1], ) self._parsed_key_map[index] = ( self._parsed_key_map[index][0], self._parsed_key_map[index + 1][1], ) del self._parsed_byte_map[index + 1] del self._parsed_key_map[index + 1] elif self._parsed_byte_map[index][1] == start: # extend the lower entry self._parsed_byte_map[index] = (self._parsed_byte_map[index][0], end) self._parsed_key_map[index] = (self._parsed_key_map[index][0], end_key) elif ( index + 1 < len(self._parsed_byte_map) and self._parsed_byte_map[index + 1][0] == end ): # extend the higher entry self._parsed_byte_map[index + 1] = ( start, self._parsed_byte_map[index + 1][1], ) self._parsed_key_map[index + 1] = ( start_key, self._parsed_key_map[index + 1][1], ) else: # new entry self._parsed_byte_map.insert(index + 1, new_value) self._parsed_key_map.insert(index + 1, new_key) def _read_and_parse(self, readv_ranges): """Read the ranges and parse the resulting data. :param readv_ranges: A prepared readv range list. """ if not readv_ranges: return if self._nodes is None and self._bytes_read * 2 >= self._size: # We've already read more than 50% of the file and we are about to # request more data, just _buffer_all() and be done self._buffer_all() return base_offset = self._base_offset if base_offset != 0: # Rewrite the ranges for the offset readv_ranges = [(start + base_offset, size) for start, size in readv_ranges] readv_data = self._transport.readv( self._name, readv_ranges, True, self._size + self._base_offset ) # parse for offset, data in readv_data: offset -= base_offset self._bytes_read += len(data) if offset < 0: # transport.readv() expanded to extra data which isn't part of # this index data = data[-offset:] offset = 0 if offset == 0 and len(data) == self._size: # We read the whole range, most likely because the # Transport upcast our readv ranges into one long request # for enough total data to grab the whole index. self._buffer_all(BytesIO(data)) return if self._bisect_nodes is None: # this must be the start if not (offset == 0): raise AssertionError() offset, data = self._parse_header_from_bytes(data) # print readv_ranges, "[%d:%d]" % (offset, offset + len(data)) self._parse_region(offset, data) def _signature(self): """The file signature for this index type.""" return _SIGNATURE def validate(self): """Validate that everything in the index can be accessed.""" # iter_all validates completely at the moment, so just do that. for _node in self.iter_all_entries(): pass class CombinedGraphIndex: """A GraphIndex made up from smaller GraphIndices. The backing indices must implement GraphIndex, and are presumed to be static data. Queries against the combined index will be made against the first index, and then the second and so on. The order of indices can thus influence performance significantly. For example, if one index is on local disk and a second on a remote server, the local disk index should be before the other in the index list. Also, queries tend to need results from the same indices as previous queries. So the indices will be reordered after every query to put the indices that had the result(s) of that query first (while otherwise preserving the relative ordering). """ def __init__(self, indices, reload_func=None): """Create a CombinedGraphIndex backed by indices. :param indices: An ordered list of indices to query for data. :param reload_func: A function to call if we find we are missing an index. Should have the form reload_func() => True/False to indicate if reloading actually changed anything. """ self._indices = indices self._reload_func = reload_func # Sibling indices are other CombinedGraphIndex that we should call # _move_to_front_by_name on when we auto-reorder ourself. self._sibling_indices = [] # A list of names that corresponds to the instances in self._indices, # so _index_names[0] is always the name for _indices[0], etc. Sibling # indices must all use the same set of names as each other. self._index_names = [None] * len(self._indices) def __repr__(self): """Return string representation of the combined index.""" return f"{self.__class__.__name__}({', '.join(map(repr, self._indices))})" def clear_cache(self): """See GraphIndex.clear_cache().""" for index in self._indices: index.clear_cache() def get_parent_map(self, keys): """See graph.StackedParentsProvider.get_parent_map.""" search_keys = set(keys) if _mod_revision.NULL_REVISION in search_keys: search_keys.discard(_mod_revision.NULL_REVISION) found_parents = {_mod_revision.NULL_REVISION: []} else: found_parents = {} for _index, key, _value, refs in self.iter_entries(search_keys): parents = refs[0] if not parents: parents = (_mod_revision.NULL_REVISION,) found_parents[key] = parents return found_parents __contains__ = _has_key_from_parent_map def insert_index(self, pos, index, name=None): """Insert a new index in the list of indices to query. :param pos: The position to insert the index. :param index: The index to insert. :param name: a name for this index, e.g. a pack name. These names can be used to reflect index reorderings to related CombinedGraphIndex instances that use the same names. (see set_sibling_indices) """ self._indices.insert(pos, index) self._index_names.insert(pos, name) def iter_all_entries(self): """Iterate over all keys within the index. Duplicate keys across child indices are presumed to have the same value and are only reported once. :return: An iterable of (index, key, reference_lists, value). There is no defined order for the result iteration - it will be in the most efficient order for the index. """ seen_keys = set() while True: try: for index in self._indices: for node in index.iter_all_entries(): if node[1] not in seen_keys: yield node seen_keys.add(node[1]) return except TransportNoSuchFile as e: if not self._try_reload(e): raise def iter_entries(self, keys): """Iterate over keys within the index. Duplicate keys across child indices are presumed to have the same value and are only reported once. :param keys: An iterable providing the keys to be retrieved. :return: An iterable of (index, key, reference_lists, value). There is no defined order for the result iteration - it will be in the most efficient order for the index. """ keys = set(keys) hit_indices = [] while True: try: for index in self._indices: if not keys: break index_hit = False for node in index.iter_entries(keys): keys.remove(node[1]) yield node index_hit = True if index_hit: hit_indices.append(index) break except TransportNoSuchFile as e: if not self._try_reload(e): raise self._move_to_front(hit_indices) def iter_entries_prefix(self, keys): """Iterate over keys within the index using prefix matching. Duplicate keys across child indices are presumed to have the same value and are only reported once. Prefix matching is applied within the tuple of a key, not to within the bytestring of each key element. e.g. if you have the keys ('foo', 'bar'), ('foobar', 'gam') and do a prefix search for ('foo', None) then only the former key is returned. :param keys: An iterable providing the key prefixes to be retrieved. Each key prefix takes the form of a tuple the length of a key, but with the last N elements 'None' rather than a regular bytestring. The first element cannot be 'None'. :return: An iterable as per iter_all_entries, but restricted to the keys with a matching prefix to those supplied. No additional keys will be returned, and every match that is in the index will be returned. """ keys = set(keys) if not keys: return seen_keys = set() hit_indices = [] while True: try: for index in self._indices: index_hit = False for node in index.iter_entries_prefix(keys): if node[1] in seen_keys: continue seen_keys.add(node[1]) yield node index_hit = True if index_hit: hit_indices.append(index) break except TransportNoSuchFile as e: if not self._try_reload(e): raise self._move_to_front(hit_indices) def _move_to_front(self, hit_indices): """Rearrange self._indices so that hit_indices are first. Order is maintained as much as possible, e.g. the first unhit index will be the first index in _indices after the hit_indices, and the hit_indices will be present in exactly the order they are passed to _move_to_front. _move_to_front propagates to all objects in self._sibling_indices by calling _move_to_front_by_name. """ if self._indices[: len(hit_indices)] == hit_indices: # The 'hit_indices' are already at the front (and in the same # order), no need to re-order return hit_names = self._move_to_front_by_index(hit_indices) for sibling_idx in self._sibling_indices: sibling_idx._move_to_front_by_name(hit_names) def _move_to_front_by_index(self, hit_indices): """Core logic for _move_to_front. Returns a list of names corresponding to the hit_indices param. """ indices_info = zip(self._index_names, self._indices, strict=False) if logger.isEnabledFor(logging.DEBUG): indices_info = list(indices_info) logger.debug( "CombinedGraphIndex reordering: currently %r, promoting %r", indices_info, hit_indices, ) hit_names = [] unhit_names = [] new_hit_indices = [] unhit_indices = [] for offset, (name, idx) in enumerate(indices_info): if idx in hit_indices: hit_names.append(name) new_hit_indices.append(idx) if len(new_hit_indices) == len(hit_indices): # We've found all of the hit entries, everything else is # unhit unhit_names.extend(self._index_names[offset + 1 :]) unhit_indices.extend(self._indices[offset + 1 :]) break else: unhit_names.append(name) unhit_indices.append(idx) self._indices = new_hit_indices + unhit_indices self._index_names = hit_names + unhit_names if logger.isEnabledFor(logging.DEBUG): logger.debug("CombinedGraphIndex reordered: %r", self._indices) return hit_names def _move_to_front_by_name(self, hit_names): """Moves indices named by 'hit_names' to front of the search order, as described in _move_to_front. """ # Translate names to index instances, and then call # _move_to_front_by_index. indices_info = zip(self._index_names, self._indices, strict=False) hit_indices = [] for name, idx in indices_info: if name in hit_names: hit_indices.append(idx) self._move_to_front_by_index(hit_indices) def find_ancestry(self, keys, ref_list_num): """Find the complete ancestry for the given set of keys. Note that this is a whole-ancestry request, so it should be used sparingly. :param keys: An iterable of keys to look for :param ref_list_num: The reference list which references the parents we care about. :return: (parent_map, missing_keys) """ # XXX: make this call _move_to_front? missing_keys = set() parent_map = {} keys_to_lookup = set(keys) generation = 0 while keys_to_lookup: # keys that *all* indexes claim are missing, stop searching them generation += 1 all_index_missing = None # print 'gen\tidx\tsub\tn_keys\tn_pmap\tn_miss' # print '%4d\t\t\t%4d\t%5d\t%5d' % (generation, len(keys_to_lookup), # len(parent_map), # len(missing_keys)) for _index_idx, index in enumerate(self._indices): # TODO: we should probably be doing something with # 'missing_keys' since we've already determined that # those revisions have not been found anywhere index_missing_keys = set() # Find all of the ancestry we can from this index # keep looking until the search_keys set is empty, which means # things we didn't find should be in index_missing_keys search_keys = keys_to_lookup sub_generation = 0 # print ' \t%2d\t\t%4d\t%5d\t%5d' % ( # index_idx, len(search_keys), # len(parent_map), len(index_missing_keys)) while search_keys: sub_generation += 1 # TODO: ref_list_num should really be a parameter, since # CombinedGraphIndex does not know what the ref lists # mean. search_keys = index._find_ancestors( search_keys, ref_list_num, parent_map, index_missing_keys ) # print ' \t \t%2d\t%4d\t%5d\t%5d' % ( # sub_generation, len(search_keys), # len(parent_map), len(index_missing_keys)) # Now set whatever was missing to be searched in the next index keys_to_lookup = index_missing_keys if all_index_missing is None: all_index_missing = set(index_missing_keys) else: all_index_missing.intersection_update(index_missing_keys) if not keys_to_lookup: break if all_index_missing is None: # There were no indexes, so all search keys are 'missing' missing_keys.update(keys_to_lookup) keys_to_lookup = None else: missing_keys.update(all_index_missing) keys_to_lookup.difference_update(all_index_missing) return parent_map, missing_keys def key_count(self): """Return an estimate of the number of keys in this index. For CombinedGraphIndex this is approximated by the sum of the keys of the child indices. As child indices may have duplicate keys this can have a maximum error of the number of child indices * largest number of keys in any index. """ while True: try: return sum((index.key_count() for index in self._indices), 0) except TransportNoSuchFile as e: if not self._try_reload(e): raise missing_keys = _missing_keys_from_parent_map def _try_reload(self, error): """We just got a NoSuchFile exception. Try to reload the indices, if it fails, just raise the current exception. """ if self._reload_func is None: return False logger.debug("Trying to reload after getting exception: %s", str(error)) if not self._reload_func(): # We tried to reload, but nothing changed, so we fail anyway logger.debug( "_reload_func indicated nothing has changed." " Raising original exception." ) return False return True def set_sibling_indices(self, sibling_combined_graph_indices): """Set the CombinedGraphIndex objects to reorder after reordering self.""" self._sibling_indices = sibling_combined_graph_indices def validate(self): """Validate that everything in the index can be accessed.""" while True: try: for index in self._indices: index.validate() return except TransportNoSuchFile as e: if not self._try_reload(e): raise class InMemoryGraphIndex(GraphIndexBuilder): """A GraphIndex which operates entirely out of memory and is mutable. This is designed to allow the accumulation of GraphIndex entries during a single write operation, where the accumulated entries need to be immediately available - for example via a CombinedGraphIndex. """ def add_nodes(self, nodes): """Add nodes to the index. :param nodes: An iterable of (key, node_refs, value) entries to add. """ if self.reference_lists: for key, value, node_refs in nodes: self.add_node(key, value, node_refs) else: for key, value in nodes: self.add_node(key, value) def iter_all_entries(self): """Iterate over all keys within the index. :return: An iterable of (index, key, reference_lists, value). There is no defined order for the result iteration - it will be in the most efficient order for the index (in this case dictionary hash order). """ evil_logger.debug("iter_all_entries scales with size of history.") if self.reference_lists: for key, (absent, references, value) in self._nodes.items(): if not absent: yield self, key, value, references else: for key, (absent, _references, value) in self._nodes.items(): if not absent: yield self, key, value def iter_entries(self, keys): """Iterate over keys within the index. :param keys: An iterable providing the keys to be retrieved. :return: An iterable of (index, key, value, reference_lists). There is no defined order for the result iteration - it will be in the most efficient order for the index (keys iteration order in this case). """ # Note: See BTreeBuilder.iter_entries for an explanation of why we # aren't using set().intersection() here nodes = self._nodes keys = [key for key in keys if key in nodes] if self.reference_lists: for key in keys: node = nodes[key] if not node[0]: yield self, key, node[2], node[1] else: for key in keys: node = nodes[key] if not node[0]: yield self, key, node[2] def iter_entries_prefix(self, keys): """Iterate over keys within the index using prefix matching. Prefix matching is applied within the tuple of a key, not to within the bytestring of each key element. e.g. if you have the keys ('foo', 'bar'), ('foobar', 'gam') and do a prefix search for ('foo', None) then only the former key is returned. :param keys: An iterable providing the key prefixes to be retrieved. Each key prefix takes the form of a tuple the length of a key, but with the last N elements 'None' rather than a regular bytestring. The first element cannot be 'None'. :return: An iterable as per iter_all_entries, but restricted to the keys with a matching prefix to those supplied. No additional keys will be returned, and every match that is in the index will be returned. """ keys = set(keys) if not keys: return if self._key_length == 1: for key in keys: _sanity_check_key(self, key) node = self._nodes[key] if node[0]: continue if self.reference_lists: yield self, key, node[2], node[1] else: yield self, key, node[2] return nodes_by_key = self._get_nodes_by_key() yield from _iter_entries_prefix(self, nodes_by_key, keys) def key_count(self): """Return an estimate of the number of keys in this index. For InMemoryGraphIndex the estimate is exact. """ return len(self._nodes) - len(self._absent_keys) def validate(self): """In memory index's have no known corruption at the moment.""" def __lt__(self, other): """Return True if self < other for ordering purposes.""" # We don't really care about the order, just that there is an order. if not isinstance(other, GraphIndex) and not isinstance( other, InMemoryGraphIndex ): raise TypeError(other) return hash(self) < hash(other) class GraphIndexPrefixAdapter: """An adapter between GraphIndex with different key lengths. Queries against this will emit queries against the adapted Graph with the prefix added, queries for all items use iter_entries_prefix. The returned nodes will have their keys and node references adjusted to remove the prefix. Finally, an add_nodes_callback can be supplied - when called the nodes and references being added will have prefix prepended. """ def __init__(self, adapted, prefix, missing_key_length, add_nodes_callback=None): """Construct an adapter against adapted with prefix.""" self.adapted = adapted self.prefix_key = prefix + (None,) * missing_key_length self.prefix = prefix self.prefix_len = len(prefix) self.add_nodes_callback = add_nodes_callback def add_nodes(self, nodes): """Add nodes to the index. :param nodes: An iterable of (key, node_refs, value) entries to add. """ # save nodes in case its an iterator nodes = tuple(nodes) translated_nodes = [] try: # Add prefix_key to each reference node_refs is a tuple of tuples, # so split it apart, and add prefix_key to the internal reference for key, value, node_refs in nodes: adjusted_references = tuple( tuple(self.prefix + ref_node for ref_node in ref_list) for ref_list in node_refs ) translated_nodes.append((self.prefix + key, value, adjusted_references)) except ValueError: # XXX: TODO add an explicit interface for getting the reference list # status, to handle this bit of user-friendliness in the API more # explicitly. for key, value in nodes: translated_nodes.append((self.prefix + key, value)) self.add_nodes_callback(translated_nodes) def add_node(self, key, value, references=()): r"""Add a node to the index. :param key: The key. keys are non-empty tuples containing as many whitespace-free utf8 bytestrings as the key length defined for this index. :param references: An iterable of iterables of keys. Each is a reference to another key. :param value: The value to associate with the key. It may be any bytes as long as it does not contain \0 or \n. """ self.add_nodes(((key, value, references),)) def _strip_prefix(self, an_iter): """Strip prefix data from nodes and return it.""" for node in an_iter: # cross checks if node[1][: self.prefix_len] != self.prefix: raise BadIndexData(self) for ref_list in node[3]: for ref_node in ref_list: if ref_node[: self.prefix_len] != self.prefix: raise BadIndexData(self) yield ( node[0], node[1][self.prefix_len :], node[2], ( tuple( tuple(ref_node[self.prefix_len :] for ref_node in ref_list) for ref_list in node[3] ) ), ) def iter_all_entries(self): """Iterate over all keys within the index. iter_all_entries is implemented against the adapted index using iter_entries_prefix. :return: An iterable of (index, key, reference_lists, value). There is no defined order for the result iteration - it will be in the most efficient order for the index (in this case dictionary hash order). """ return self._strip_prefix(self.adapted.iter_entries_prefix([self.prefix_key])) def iter_entries(self, keys): """Iterate over keys within the index. :param keys: An iterable providing the keys to be retrieved. :return: An iterable of (index, key, value, reference_lists). There is no defined order for the result iteration - it will be in the most efficient order for the index (keys iteration order in this case). """ return self._strip_prefix( self.adapted.iter_entries(self.prefix + key for key in keys) ) def iter_entries_prefix(self, keys): """Iterate over keys within the index using prefix matching. Prefix matching is applied within the tuple of a key, not to within the bytestring of each key element. e.g. if you have the keys ('foo', 'bar'), ('foobar', 'gam') and do a prefix search for ('foo', None) then only the former key is returned. :param keys: An iterable providing the key prefixes to be retrieved. Each key prefix takes the form of a tuple the length of a key, but with the last N elements 'None' rather than a regular bytestring. The first element cannot be 'None'. :return: An iterable as per iter_all_entries, but restricted to the keys with a matching prefix to those supplied. No additional keys will be returned, and every match that is in the index will be returned. """ return self._strip_prefix( self.adapted.iter_entries_prefix(self.prefix + key for key in keys) ) def key_count(self): """Return an estimate of the number of keys in this index. For GraphIndexPrefixAdapter this is relatively expensive - key iteration with the prefix is done. """ return len(list(self.iter_all_entries())) def validate(self): """Call the adapted's validate.""" self.adapted.validate() def _sanity_check_key(index_or_builder, key): """Raise BadIndexKey if key cannot be used for prefix matching.""" if key[0] is None: raise BadIndexKey(key) if len(key) != index_or_builder._key_length: raise BadIndexKey(key) def _iter_entries_prefix(index_or_builder, nodes_by_key, keys): """Helper for implementing prefix matching iterators.""" for key in keys: _sanity_check_key(index_or_builder, key) # find what it refers to: key_dict = nodes_by_key elements = list(key) # find the subdict whose contents should be returned. try: while len(elements) and elements[0] is not None: key_dict = key_dict[elements[0]] elements.pop(0) except KeyError: # a non-existant lookup. continue if len(elements): dicts = [key_dict] while dicts: values_view = dicts.pop().values() # can't be empty or would not exist value = next(iter(values_view)) if isinstance(value, dict): # still descending, push values dicts.extend(values_view) else: # at leaf tuples, yield values for value in values_view: # each value is the key:value:node refs tuple # ready to yield. yield (index_or_builder,) + value else: # the last thing looked up was a terminal element yield (index_or_builder,) + key_dict bzrformats_3.4.0.orig/bzrformats/inventory.py0000644000000000000000000014505615162115103016464 0ustar00# Copyright (C) 2005-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Inventory management for Bazaar. This module provides classes and functions for managing file inventories in Bazaar repositories, including entries for files, directories, symlinks, and tree references. """ # FIXME: This refactoring of the workingtree code doesn't seem to keep # the WorkingTree's copy of the inventory in sync with the branch. The # branch modifies its working inventory when it does a commit to make # missing files permanently removed. # TODO: Maybe also keep the full path of the entry, and the children? # But those depend on its position within a particular inventory, and # it would be nice not to need to hold the backpointer here. __all__ = [ "ROOT_ID", "CHKInventory", "FileId", "Inventory", "InventoryDirectory", "InventoryEntry", "InventoryFile", "InventoryLink", "TreeReference", ] import contextlib from collections import deque from collections.abc import Iterable from . import osutils from ._bzr_rs import ROOT_ID from ._bzr_rs import inventory as _mod_inventory_rs from .errors import BadFileKindError, BzrFormatsError, InconsistentDelta class NoSuchId(BzrFormatsError): """File ID not found in tree. Raised when a requested file ID is not present in the tree. """ _fmt = 'The file id "%(file_id)s" is not present in the tree %(tree)s.' def __init__(self, tree, file_id): super().__init__() self.tree = tree self.file_id = file_id FileId = bytes InventoryEntry = _mod_inventory_rs.InventoryEntry InventoryFile = _mod_inventory_rs.InventoryFile InventoryDirectory = _mod_inventory_rs.InventoryDirectory TreeReference = _mod_inventory_rs.TreeReference InventoryLink = _mod_inventory_rs.InventoryLink Inventory = _mod_inventory_rs.Inventory class InvalidEntryName(BzrFormatsError): _fmt = "Invalid entry name: %(name)s" def __init__(self, name): super().__init__() self.name = name class DuplicateFileId(BzrFormatsError): _fmt = "File id {%(file_id)s} already exists in inventory as %(entry)s" def __init__(self, file_id, entry): super().__init__() self.file_id = file_id self.entry = entry class CHKInventory: """An inventory persisted in a CHK store. By design, a CHKInventory is immutable so many of the methods supported by Inventory - add, rename, apply_delta, etc - are *not* supported. To create a new CHKInventory, use create_by_apply_delta() or from_inventory(), say. Internally, a CHKInventory has one or two CHKMaps: * id_to_entry - a map from (file_id,) => InventoryEntry as bytes * parent_id_basename_to_file_id - a map from (parent_id, basename_utf8) => file_id as bytes The second map is optional and not present in early CHkRepository's. No caching is performed: every method call or item access will perform requests to the storage layer. As such, keep references to objects you want to reuse. """ def has_filename(self, filename): """Check if a filename exists in the inventory. Args: filename: Path to check for existence. Returns: True if the filename exists, False otherwise. """ return bool(self.path2id(filename)) def id2path(self, file_id): """Return as a string the path to file_id. >>> i = Inventory() >>> e = i.add(InventoryDirectory(b'src-id', 'src', ROOT_ID)) >>> e = i.add(InventoryFile(b'foo-id', 'foo.c', parent_id=b'src-id')) >>> print(i.id2path(b'foo-id')) src/foo.c :raises NoSuchId: If file_id is not present in the inventory. """ # get all names, skipping root return "/".join( reversed( [parent.name for parent in self._iter_file_id_parents(file_id)][:-1] ) ) def iter_entries(self, from_dir=None, recursive=True): """Return (path, entry) pairs, in order by name. :param from_dir: if None, start from the root, otherwise start from this directory (either file-id or entry) :param recursive: recurse into directories or not """ if from_dir is None: if self.root is None: return from_dir = self.root.file_id yield "", self.root elif not isinstance(from_dir, bytes): from_dir = from_dir.file_id # unrolling the recursive called changed the time from # 440ms/663ms (inline/total) to 116ms/116ms children = [(c.name, c) for c in self.iter_sorted_children(from_dir)] if not recursive: yield from children return children = deque(children) stack = [("", children)] while stack: from_dir_relpath, children = stack[-1] while children: name, ie = children.popleft() # we know that from_dir_relpath never ends in a slash # and 'f' doesn't begin with one, we can do a string op, rather # than the checks of pathjoin(), though this means that all paths # start with a slash path = from_dir_relpath + "/" + name yield path[1:], ie if ie.kind != "directory": continue # But do this child first new_children = [ (c.name, c) for c in self.iter_sorted_children(ie.file_id) ] new_children = deque(new_children) stack.append((path, new_children)) # Break out of inner loop, so that we start outer loop with child break else: # if we finished all children, pop it off the stack stack.pop() def iter_sorted_children(self, file_id): """Iterate through children of a directory in sorted order. Args: file_id: The file ID of the directory. Yields: Child inventory entries sorted by name. """ return (c for (_n, c) in sorted(self.get_children(file_id).items())) def iter_entries_by_dir(self, from_dir=None, specific_file_ids=None): """Iterate over the entries in a directory first order. This returns all entries for a directory before returning the entries for children of a directory. This is not lexicographically sorted order, and is a hybrid between depth-first and breadth-first. :return: This yields (path, entry) pairs """ if specific_file_ids and not isinstance(specific_file_ids, set): specific_file_ids = set(specific_file_ids) # TODO? Perhaps this should return the from_dir so that the root is # yielded? or maybe an option? if from_dir is None and specific_file_ids is None: # They are iterating from the root, and have not specified any # specific entries to look at. All current callers fully consume the # iterator, so we can safely assume we are accessing all entries self._preload_cache() if from_dir is None: if self.root is None: return # Optimize a common case if specific_file_ids is not None and len(specific_file_ids) == 1: file_id = list(specific_file_ids)[0] if file_id is not None: try: path = self.id2path(file_id) except NoSuchId: pass else: yield path, self.get_entry(file_id) return from_dir = self.root if specific_file_ids is None or self.root.file_id in specific_file_ids: yield "", self.root elif isinstance(from_dir, bytes): from_dir = self.get_entry(from_dir) else: raise TypeError(from_dir) if specific_file_ids is not None: # TODO: jam 20070302 This could really be done as a loop rather # than a bunch of recursive calls. parents = set() byid = self def add_ancestors(file_id): if not byid.has_id(file_id): return parent_id = byid.get_entry(file_id).parent_id if parent_id is None: return if parent_id not in parents: parents.add(parent_id) add_ancestors(parent_id) for file_id in specific_file_ids: add_ancestors(file_id) else: parents = None stack = [("", from_dir)] while stack: cur_relpath, cur_dir = stack.pop() child_dirs = [] for child_ie in self.iter_sorted_children(cur_dir.file_id): child_relpath = cur_relpath + child_ie.name if specific_file_ids is None or child_ie.file_id in specific_file_ids: yield child_relpath, child_ie if child_ie.kind == "directory": if parents is None or child_ie.file_id in parents: child_dirs.append((child_relpath + "/", child_ie)) stack.extend(reversed(child_dirs)) def make_entry(self, kind, name, parent_id, file_id=None, revision=None, **kwargs): """Simple thunk to bzrformats.inventory.make_entry.""" return make_entry(kind, name, parent_id, file_id, revision, **kwargs) def entries(self): """Return list of (path, ie) for all entries except the root. This may be faster than iter_entries. """ accum = [] def descend(dir_ie, dir_path): for ie in self.iter_sorted_children(dir_ie.file_id): child_path = osutils.pathjoin(dir_path, ie.name) accum.append((child_path, ie)) if ie.kind == "directory": descend(ie, child_path) if self.root is not None: descend(self.root, "") return accum def get_entry_by_path_partial(self, relpath): """Like get_entry_by_path, but return TreeReference objects. :param relpath: Path to resolve, either as string with / as separators, or as list of elements. :return: tuple with ie, resolved elements and elements left to resolve """ names = osutils.splitpath(relpath) if isinstance(relpath, str) else relpath try: parent = self.root except NoSuchId: # root doesn't exist yet so nothing else can return None, None, None if parent is None: return None, None, None for i, f in enumerate(names): try: cie = self.get_child(parent.file_id, f) if cie is None: return None, None, None if cie.kind == "tree-reference": return cie, names[: i + 1], names[i + 1 :] parent = cie except KeyError: # or raise an error? return None, None, None return parent, names, [] def get_entry_by_path(self, relpath): """Return an inventory entry by path. :param relpath: may be either a list of path components, or a single string, in which case it is automatically split. This returns the entry of the last component in the path, which may be either a file or a directory. Returns None IFF the path is not found. """ names = osutils.splitpath(relpath) if isinstance(relpath, str) else relpath try: parent = self.root except NoSuchId: # root doesn't exist yet so nothing else can return None if parent is None: return None for f in names: try: cie = self.get_child(parent.file_id, f) if cie is None: return None parent = cie except KeyError: # or raise an error? return None return parent def get_idpath(self, file_id): """Return a list of file_ids for the path to an entry. The list contains one element for each directory followed by the id of the file itself. So the length of the returned list is equal to the depth of the file in the tree, counting the root directory as depth 1. """ raise NotImplementedError(self.get_idpath) def __init__(self, search_key_name): """Initialize CHKInventory with a search key name. Args: search_key_name: Name of the search key for CHK operations. """ self._fileid_to_entry_cache = {} self._fully_cached = False self._path_to_fileid_cache = {} self._search_key_name = search_key_name self.root_id = None self._children_cache = {} def __eq__(self, other): """Compare two sets by comparing their contents.""" if not isinstance(other, CHKInventory): return NotImplemented this_key = self.id_to_entry.key() other_key = other.id_to_entry.key() this_pid_key = self.parent_id_basename_to_file_id.key() other_pid_key = other.parent_id_basename_to_file_id.key() if None in (this_key, this_pid_key, other_key, other_pid_key): return False return this_key == other_key and this_pid_key == other_pid_key def get_children(self, dir_id): """Access the list of children of this directory. With a parent_id_basename_to_file_id index, loads all the children, without loads the entire index. Without is bad. A more sophisticated proxy object might be nice, to allow partial loading of children as well when specific names are accessed. (So path traversal can be written in the obvious way but not examine siblings.). """ children = self._children_cache.get(dir_id) if children is not None: return children # No longer supported if self.parent_id_basename_to_file_id is None: raise AssertionError( "Inventories without" " parent_id_basename_to_file_id are no longer supported" ) result = {} # XXX: Todo - use proxy objects for the children rather than loading # all when the attribute is referenced. child_keys = set() for ( _parent_id, _name_utf8, ), file_id in self.parent_id_basename_to_file_id.iteritems( key_filter=[(dir_id,)] ): child_keys.add((file_id,)) cached = set() for file_id_key in child_keys: entry = self._fileid_to_entry_cache.get(file_id_key[0], None) if entry is not None: result[entry.name] = entry cached.add(file_id_key) child_keys.difference_update(cached) # populate; todo: do by name id_to_entry = self.id_to_entry for file_id_key, bytes in id_to_entry.iteritems(child_keys): entry = self._bytes_to_entry(bytes) result[entry.name] = entry self._fileid_to_entry_cache[file_id_key[0]] = entry self._children_cache[dir_id] = result return result def get_child(self, dir_id, name): """Get a specific child from a directory. Args: dir_id: The file ID of the directory. name: The name of the child to retrieve. Returns: The child inventory entry or None if not found. """ # TODO(jelmer): Implement a version that doesn't load all children. return self.get_children(dir_id).get(name) def _expand_fileids_to_parents_and_children(self, file_ids): """Give a more wholistic view starting with the given file_ids. For any file_id which maps to a directory, we will include all children of that directory. We will also include all directories which are parents of the given file_ids, but we will not include their children. eg: / # TREE_ROOT foo/ # foo-id baz # baz-id frob/ # frob-id fringle # fringle-id bar/ # bar-id bing # bing-id if given [foo-id] we will include TREE_ROOT as interesting parents and foo-id, baz-id, frob-id, fringle-id As interesting ids. """ interesting = set() # TODO: Pre-pass over the list of fileids to see if anything is already # deserialized in self._fileid_to_entry_cache directories_to_expand = set() children_of_parent_id = {} # It is okay if some of the fileids are missing for entry in self._getitems(file_ids): if entry.kind == "directory": directories_to_expand.add(entry.file_id) interesting.add(entry.parent_id) children_of_parent_id.setdefault(entry.parent_id, set()).add(entry.file_id) # Now, interesting has all of the direct parents, but not the # parents of those parents. It also may have some duplicates with # specific_fileids remaining_parents = interesting.difference(file_ids) # When we hit the TREE_ROOT, we'll get an interesting parent of None, # but we don't actually want to recurse into that interesting.add(None) # this will auto-filter it in the loop remaining_parents.discard(None) while remaining_parents: next_parents = set() for entry in self._getitems(remaining_parents): next_parents.add(entry.parent_id) children_of_parent_id.setdefault(entry.parent_id, set()).add( entry.file_id ) # Remove any search tips we've already processed remaining_parents = next_parents.difference(interesting) interesting.update(remaining_parents) # We should probably also .difference(directories_to_expand) interesting.update(file_ids) interesting.discard(None) while directories_to_expand: # Expand directories by looking in the # parent_id_basename_to_file_id map keys = [(f,) for f in directories_to_expand] directories_to_expand = set() items = self.parent_id_basename_to_file_id.iteritems(keys) next_file_ids = {item[1] for item in items} next_file_ids = next_file_ids.difference(interesting) interesting.update(next_file_ids) for entry in self._getitems(next_file_ids): if entry.kind == "directory": directories_to_expand.add(entry.file_id) children_of_parent_id.setdefault(entry.parent_id, set()).add( entry.file_id ) return interesting, children_of_parent_id def filter(self, specific_fileids): """Get an inventory view filtered against a set of file-ids. Children of directories and parents are included. The result may or may not reference the underlying inventory so it should be treated as immutable. """ ( interesting, parent_to_children, ) = self._expand_fileids_to_parents_and_children(specific_fileids) # There is some overlap here, but we assume that all interesting items # are in the _fileid_to_entry_cache because we had to read them to # determine if they were a dir we wanted to recurse, or just a file # This should give us all the entries we'll want to add, so start # adding other = Inventory(root_id=None) root = InventoryDirectory(self.root_id, "", None, self.root.revision) other.add(root) other.revision_id = self.revision_id if not interesting or not parent_to_children: # empty filter, or filtering entrys that don't exist # (if even 1 existed, then we would have populated # parent_to_children with at least the tree root.) return other cache = self._fileid_to_entry_cache remaining_children = deque(parent_to_children[self.root_id]) while remaining_children: file_id = remaining_children.popleft() ie = cache[file_id] if ie.kind == "directory": ie = ie.copy() # We create a copy to depopulate the .children attribute # TODO: depending on the uses of 'other' we should probably alwyas # '.copy()' to prevent someone from mutating other and # invaliding our internal cache other.add(ie) if file_id in parent_to_children: remaining_children.extend(parent_to_children[file_id]) return other def _bytes_to_entry(self, bytes): """Deserialise a serialised entry.""" result = _chk_inventory_bytes_to_entry(bytes) self._fileid_to_entry_cache[result.file_id] = result return result def create_by_apply_delta( self, inventory_delta, new_revision_id, propagate_caches=False ): """Create a new CHKInventory by applying inventory_delta to this one. See the inventory developers documentation for the theory behind inventory deltas. :param inventory_delta: The inventory delta to apply. See Inventory.apply_delta for details. :param new_revision_id: The revision id of the resulting CHKInventory. :param propagate_caches: If True, the caches for this inventory are copied to and updated for the result. :return: The new CHKInventory. """ split = osutils.split result = CHKInventory(self._search_key_name) if propagate_caches: # Just propagate the path-to-fileid cache for now result._path_to_fileid_cache = self._path_to_fileid_cache.copy() from . import chk_map search_key_func = chk_map.search_key_registry.get(self._search_key_name) self.id_to_entry._ensure_root() maximum_size = self.id_to_entry._root_node.maximum_size result.revision_id = new_revision_id result.id_to_entry = chk_map.CHKMap( self.id_to_entry._store, self.id_to_entry.key(), search_key_func=search_key_func, ) result.id_to_entry._ensure_root() result.id_to_entry._root_node.set_maximum_size(maximum_size) # Change to apply to the parent_id_basename delta. The dict maps # (parent_id, basename) -> (old_key, new_value). We use a dict because # when a path has its id replaced (e.g. the root is changed, or someone # does bzr mv a b, bzr mv c a, we should output a single change to this # map rather than two. parent_id_basename_delta = {} if self.parent_id_basename_to_file_id is not None: result.parent_id_basename_to_file_id = chk_map.CHKMap( self.parent_id_basename_to_file_id._store, self.parent_id_basename_to_file_id.key(), search_key_func=search_key_func, ) result.parent_id_basename_to_file_id._ensure_root() self.parent_id_basename_to_file_id._ensure_root() result_p_id_root = result.parent_id_basename_to_file_id._root_node p_id_root = self.parent_id_basename_to_file_id._root_node result_p_id_root.set_maximum_size(p_id_root.maximum_size) result_p_id_root._key_width = p_id_root._key_width else: result.parent_id_basename_to_file_id = None result.root_id = self.root_id id_to_entry_delta = [] # inventory_delta is only traversed once, so we just update the # variable. inventory_delta.check() # All changed entries need to have their parents be directories and be # at the right path. This set contains (path, id) tuples. parents = set() # When we delete an item, all the children of it must be either deleted # or altered in their own right. As we batch process the change via # CHKMap.apply_delta, we build a set of things to use to validate the # delta. deletes = set() altered = set() for old_path, new_path, file_id, entry in inventory_delta: # file id changes if new_path == "": result.root_id = file_id if new_path is None: # Make a delete: new_key = None new_value = None # Update caches if propagate_caches: with contextlib.suppress(KeyError): del result._path_to_fileid_cache[old_path] deletes.add(file_id) else: new_key = (file_id,) new_value = _chk_inventory_entry_to_bytes(entry) # Update caches. It's worth doing this whether # we're propagating the old caches or not. result._path_to_fileid_cache[new_path] = file_id parents.add((split(new_path)[0], entry.parent_id)) if old_path is None: old_key = None else: old_key = (file_id,) if self.id2path(file_id) != old_path: raise InconsistentDelta( old_path, file_id, "Entry was at wrong other path {!r}.".format( self.id2path(file_id) ), ) altered.add(file_id) id_to_entry_delta.append((old_key, new_key, new_value)) if result.parent_id_basename_to_file_id is not None: # parent_id, basename changes if old_path is None: old_key = None else: old_entry = self.get_entry(file_id) old_key = self._parent_id_basename_key(old_entry) if new_path is None: new_key = None new_value = None else: new_key = self._parent_id_basename_key(entry) new_value = file_id # If the two keys are the same, the value will be unchanged # as its always the file id for this entry. if old_key != new_key: # Transform a change into explicit delete/add preserving # a possible match on the key from a different file id. if old_key is not None: parent_id_basename_delta.setdefault(old_key, [None, None])[ 0 ] = old_key if new_key is not None: parent_id_basename_delta.setdefault(new_key, [None, None])[ 1 ] = new_value # validate that deletes are complete. for file_id in deletes: entry = self.get_entry(file_id) if entry.kind != "directory": continue # This loop could potentially be better by using the id_basename # map to just get the child file ids. for child in self.iter_sorted_children(entry.file_id): if child.file_id not in altered: raise InconsistentDelta( self.id2path(child.file_id), child.file_id, "Child not deleted or reparented when parent deleted.", ) result.id_to_entry.apply_delta(id_to_entry_delta) if parent_id_basename_delta: # Transform the parent_id_basename delta data into a linear delta # with only one record for a given key. Optimally this would allow # re-keying, but its simpler to just output that as a delete+add # to spend less time calculating the delta. delta_list = [] for key, (old_key, value) in parent_id_basename_delta.items(): if value is not None: delta_list.append((old_key, key, value)) else: delta_list.append((old_key, None, None)) result.parent_id_basename_to_file_id.apply_delta(delta_list) parents.discard(("", None)) for parent_path, parent in parents: try: if result.get_entry(parent).kind != "directory": raise InconsistentDelta( result.id2path(parent), parent, "Not a directory, but given children", ) except NoSuchId as e: raise InconsistentDelta( "", parent, "Parent is not present in resulting inventory." ) from e if result.path2id(parent_path) != parent: raise InconsistentDelta( parent_path, parent, f"Parent has wrong path {result.path2id(parent_path)!r}.", ) return result @classmethod def deserialise(klass, chk_store, lines, expected_revision_id): """Deserialise a CHKInventory. :param chk_store: A CHK capable VersionedFiles instance. :param bytes: The serialised bytes. :param expected_revision_id: The revision ID we think this inventory is for. :return: A CHKInventory """ if not lines[-1].endswith(b"\n"): raise ValueError("last line should have trailing eol\n") if lines[0] != b"chkinventory:\n": raise ValueError(f"not a serialised CHKInventory: {bytes!r}") info = {} allowed_keys = frozenset( ( b"root_id", b"revision_id", b"parent_id_basename_to_file_id", b"search_key_name", b"id_to_entry", ) ) for line in lines[1:]: key, value = line.rstrip(b"\n").split(b": ", 1) if key not in allowed_keys: raise BzrFormatsError(f"Unknown key in inventory: {key!r}\n{bytes!r}") if key in info: raise BzrFormatsError(f"Duplicate key in inventory: {key!r}\n{bytes!r}") info[key] = value revision_id = info[b"revision_id"] root_id = info[b"root_id"] search_key_name = info.get(b"search_key_name", b"plain") parent_id_basename_to_file_id = info.get(b"parent_id_basename_to_file_id") if not parent_id_basename_to_file_id.startswith(b"sha1:"): raise ValueError( "parent_id_basename_to_file_id should be a sha1" f" key not {parent_id_basename_to_file_id!r}" ) id_to_entry = info[b"id_to_entry"] if not id_to_entry.startswith(b"sha1:"): raise ValueError(f"id_to_entry should be a sha1 key not {id_to_entry!r}") result = CHKInventory(search_key_name) result.revision_id = revision_id result.root_id = root_id from . import chk_map search_key_func = chk_map.search_key_registry.get(result._search_key_name) if parent_id_basename_to_file_id is not None: result.parent_id_basename_to_file_id = chk_map.CHKMap( chk_store, (parent_id_basename_to_file_id,), search_key_func=search_key_func, ) else: result.parent_id_basename_to_file_id = None result.id_to_entry = chk_map.CHKMap( chk_store, (id_to_entry,), search_key_func=search_key_func, ) if (result.revision_id,) != expected_revision_id: raise ValueError( f"Mismatched revision id and expected: {result.revision_id!r}, {expected_revision_id!r}" ) return result @classmethod def from_inventory( klass, chk_store, inventory, maximum_size=0, search_key_name=b"plain" ): """Create a CHKInventory from an existing inventory. The content of inventory is copied into the chk_store, and a CHKInventory referencing that is returned. :param chk_store: A CHK capable VersionedFiles instance. :param inventory: The inventory to copy. :param maximum_size: The CHKMap node size limit. :param search_key_name: The identifier for the search key function """ result = klass(search_key_name) result.revision_id = inventory.revision_id result.root_id = inventory.root.file_id parent_id_basename_key = result._parent_id_basename_key id_to_entry_dict = {} parent_id_basename_dict = {} for _path, entry in inventory.iter_entries(): key = (entry.file_id,) id_to_entry_dict[key] = _chk_inventory_entry_to_bytes(entry) p_id_key = parent_id_basename_key(entry) parent_id_basename_dict[p_id_key] = entry.file_id result._populate_from_dicts( chk_store, id_to_entry_dict, parent_id_basename_dict, maximum_size=maximum_size, ) return result def _populate_from_dicts( self, chk_store, id_to_entry_dict, parent_id_basename_dict, maximum_size ): from . import chk_map search_key_func = chk_map.search_key_registry.get(self._search_key_name) root_key = chk_map.CHKMap.from_dict( chk_store, id_to_entry_dict, maximum_size=maximum_size, key_width=1, search_key_func=search_key_func, ) self.id_to_entry = chk_map.CHKMap(chk_store, root_key, search_key_func) root_key = chk_map.CHKMap.from_dict( chk_store, parent_id_basename_dict, maximum_size=maximum_size, key_width=2, search_key_func=search_key_func, ) self.parent_id_basename_to_file_id = chk_map.CHKMap( chk_store, root_key, search_key_func ) def _parent_id_basename_key(self, entry): """Create a key for a entry in a parent_id_basename_to_file_id index.""" parent_id = entry.parent_id if entry.parent_id is not None else b"" return (parent_id, entry.name.encode("utf8")) def get_entry(self, file_id): """Map a single file_id -> InventoryEntry.""" if file_id is None: raise NoSuchId(self, file_id) result = self._fileid_to_entry_cache.get(file_id, None) if result is not None: return result try: return self._bytes_to_entry( next(self.id_to_entry.iteritems([(file_id,)]))[1] ) except StopIteration as e: # really we're passing an inventory, not a tree... raise NoSuchId(self, file_id) from e def _getitems(self, file_ids: Iterable[FileId]) -> list[InventoryEntry]: # type: ignore """Similar to get_entry, but lets you query for multiple. The returned order is undefined. And currently if an item doesn't exist, it isn't included in the output. """ result: list[InventoryEntry] = [] # type: ignore remaining: list[FileId] = [] for file_id in file_ids: entry = self._fileid_to_entry_cache.get(file_id, None) if entry is None: remaining.append(file_id) else: result.append(entry) file_keys = [(f,) for f in remaining] for _file_key, value in self.id_to_entry.iteritems(file_keys): entry = self._bytes_to_entry(value) result.append(entry) self._fileid_to_entry_cache[entry.file_id] = entry return result def has_id(self, file_id): """Check if a file_id exists in the inventory. Args: file_id: The file ID to check. Returns: True if the file_id exists, False otherwise. """ # Perhaps have an explicit 'contains' method on CHKMap ? if self._fileid_to_entry_cache.get(file_id, None) is not None: return True return len(list(self.id_to_entry.iteritems([(file_id,)]))) == 1 def is_root(self, file_id): """Check if a file_id is the root of the inventory. Args: file_id: The file ID to check. Returns: True if this is the root file ID, False otherwise. """ return file_id == self.root_id def _iter_file_id_parents(self, file_id): """Yield the parents of file_id up to the root.""" while file_id is not None: try: ie = self.get_entry(file_id) except KeyError as e: raise NoSuchId(tree=self, file_id=file_id) from e yield ie file_id = ie.parent_id def iter_all_ids(self): """Iterate over all file-ids.""" for key, _ in self.id_to_entry.iteritems(): yield key[-1] def iter_just_entries(self): """Iterate over all entries. Unlike iter_entries(), just the entries are returned (not (path, ie)) and the order of entries is undefined. XXX: We may not want to merge this into bzr.dev. """ for key, entry in self.id_to_entry.iteritems(): file_id = key[0] ie = self._fileid_to_entry_cache.get(file_id, None) if ie is None: ie = self._bytes_to_entry(entry) self._fileid_to_entry_cache[file_id] = ie yield ie def _preload_cache(self): """Make sure all file-ids are in _fileid_to_entry_cache.""" if self._fully_cached: return # No need to do it again # The optimal sort order is to use iteritems() directly cache = self._fileid_to_entry_cache for key, entry in self.id_to_entry.iteritems(): file_id = key[0] if file_id not in cache: ie = self._bytes_to_entry(entry) cache[file_id] = ie else: ie = cache[file_id] last_parent_id = last_parent_ie = None pid_items = self.parent_id_basename_to_file_id.iteritems() for key, child_file_id in pid_items: if key == (b"", b""): # This is the root if child_file_id != self.root_id: raise ValueError( "Data inconsistency detected." ' We expected data with key ("","") to match' f" the root id, but {child_file_id} != {self.root_id}" ) continue parent_id, basename = key ie = cache[child_file_id] if parent_id == last_parent_id: if last_parent_ie is None: raise AssertionError("last_parent_ie should not be None") parent_ie = last_parent_ie else: parent_ie = cache[parent_id] if parent_ie.kind != "directory": raise ValueError( "Data inconsistency detected." " An entry in the parent_id_basename_to_file_id map" f" has parent_id {{{parent_id}}} but the kind of that object" f' is {parent_ie.kind!r} not "directory"' ) siblings = self._children_cache.setdefault(parent_ie.file_id, {}) basename = basename.decode("utf-8") if basename in siblings: existing_ie = siblings[basename] if existing_ie != ie: raise ValueError( "Data inconsistency detected." f" Two entries with basename {basename!r} were found" f" in the parent entry {{{parent_id}}}" ) if basename != ie.name: raise ValueError( "Data inconsistency detected." " In the parent_id_basename_to_file_id map, file_id" " {{{}}} is listed as having basename {!r}, but in the" " id_to_entry map it is {!r}".format( child_file_id, basename, ie.name ) ) siblings[basename] = ie self._fully_cached = True def iter_changes(self, basis): """Generate a Tree.iter_changes change list between this and basis. :param basis: Another CHKInventory. :return: An iterator over the changes between self and basis, as per tree.iter_changes(). """ # We want: (file_id, (path_in_source, path_in_target), # changed_content, versioned, parent, name, kind, # executable) for key, basis_value, self_value in self.id_to_entry.iter_changes( basis.id_to_entry ): file_id = key[0] if basis_value is not None: basis_entry = basis._bytes_to_entry(basis_value) path_in_source = basis.id2path(file_id) basis_parent = basis_entry.parent_id basis_name = basis_entry.name basis_executable = basis_entry.executable else: path_in_source = None basis_parent = None basis_name = None basis_executable = None if self_value is not None: self_entry = self._bytes_to_entry(self_value) path_in_target = self.id2path(file_id) self_parent = self_entry.parent_id self_name = self_entry.name self_executable = self_entry.executable else: path_in_target = None self_parent = None self_name = None self_executable = None if basis_value is None: # add kind = (None, self_entry.kind) versioned = (False, True) elif self_value is None: # delete kind = (basis_entry.kind, None) versioned = (True, False) else: kind = (basis_entry.kind, self_entry.kind) versioned = (True, True) changed_content = False if kind[0] != kind[1]: changed_content = True elif kind[0] == "file": if ( self_entry.text_size != basis_entry.text_size or self_entry.text_sha1 != basis_entry.text_sha1 ): changed_content = True elif kind[0] == "symlink": if self_entry.symlink_target != basis_entry.symlink_target: changed_content = True elif kind[0] == "tree-reference": if self_entry.reference_revision != basis_entry.reference_revision: changed_content = True parent = (basis_parent, self_parent) name = (basis_name, self_name) executable = (basis_executable, self_executable) if ( not changed_content and parent[0] == parent[1] and name[0] == name[1] and executable[0] == executable[1] ): # Could happen when only the revision changed for a directory # for instance. continue yield ( file_id, (path_in_source, path_in_target), changed_content, versioned, parent, name, kind, executable, ) def __len__(self) -> int: """Return the number of entries in the inventory.""" return len(self.id_to_entry) def path2id(self, relpath): """Return the file ID for a relative path. Args: relpath: Relative path as string or list of path components. Returns: The file ID for the path, or None if not found. """ # TODO: perhaps support negative hits? if isinstance(relpath, str): names = osutils.splitpath(relpath) else: names = relpath if relpath == []: relpath = [""] relpath = osutils.pathjoin(*relpath) result = self._path_to_fileid_cache.get(relpath, None) if result is not None: return result current_id = self.root_id if current_id is None: return None parent_id_index = self.parent_id_basename_to_file_id cur_path = None for basename in names: cur_path = basename if cur_path is None else cur_path + "/" + basename basename_utf8 = basename.encode("utf8") file_id = self._path_to_fileid_cache.get(cur_path, None) if file_id is None: key_filter = [(current_id, basename_utf8)] items = parent_id_index.iteritems(key_filter) for (parent_id, name_utf8), file_id in items: # noqa: B007 if parent_id != current_id or name_utf8 != basename_utf8: raise BzrFormatsError( "corrupt inventory lookup! {!r} {!r} {!r} {!r}".format( parent_id, current_id, name_utf8, basename_utf8 ) ) if file_id is None: return None else: self._path_to_fileid_cache[cur_path] = file_id current_id = file_id return current_id def to_lines(self): """Serialise the inventory to lines.""" lines = [b"chkinventory:\n"] if self._search_key_name != b"plain": # custom ordering grouping things that don't change together lines.append(b"search_key_name: %s\n" % (self._search_key_name)) lines.append(b"root_id: %s\n" % self.root_id) lines.append( b"parent_id_basename_to_file_id: %s\n" % (self.parent_id_basename_to_file_id.key()[0],) ) lines.append(b"revision_id: %s\n" % self.revision_id) lines.append(b"id_to_entry: %s\n" % (self.id_to_entry.key()[0],)) else: lines.append(b"revision_id: %s\n" % self.revision_id) lines.append(b"root_id: %s\n" % self.root_id) if self.parent_id_basename_to_file_id is not None: lines.append( b"parent_id_basename_to_file_id: %s\n" % (self.parent_id_basename_to_file_id.key()[0],) ) lines.append(b"id_to_entry: %s\n" % (self.id_to_entry.key()[0],)) return lines @property def root(self): """Get the root entry.""" return self.get_entry(self.root_id) entry_factory = { "directory": InventoryDirectory, "file": InventoryFile, "symlink": InventoryLink, "tree-reference": TreeReference, } def make_entry(kind, name, parent_id, file_id=None, revision=None, **kwargs): """Create an inventory entry. :param kind: the type of inventory entry to create. :param name: the basename of the entry. :param parent_id: the parent_id of the entry. :param file_id: the file_id to use. if None, one will be created. """ if file_id is None: from . import generate_ids file_id = generate_ids.gen_file_id(name) name = ensure_normalized_name(name) try: factory = entry_factory[kind] except KeyError as e: raise BadFileKindError(name, kind) from e return factory(file_id, name, parent_id, revision, **kwargs) ensure_normalized_name = _mod_inventory_rs.ensure_normalized_name is_valid_name = _mod_inventory_rs.is_valid_name def mutable_inventory_from_tree(tree): """Create a new inventory that has the same contents as a specified tree. :param tree: Revision tree to create inventory from """ entries = tree.iter_entries_by_dir() inv = Inventory(None, tree.get_revision_id()) for _path, inv_entry in entries: inv.add(inv_entry.copy()) return inv chk_inventory_bytes_to_utf8name_key = ( _mod_inventory_rs.chk_inventory_bytes_to_utf8name_key ) _chk_inventory_bytes_to_entry = _mod_inventory_rs.chk_inventory_bytes_to_entry _chk_inventory_entry_to_bytes = _mod_inventory_rs.chk_inventory_entry_to_bytes def _make_delta(new, old): """Make an inventory delta from two inventories.""" from .inventory_delta import InventoryDelta if isinstance(old, CHKInventory) and isinstance(new, CHKInventory): delta = [] for key, old_value, self_value in new.id_to_entry.iter_changes(old.id_to_entry): file_id = key[0] old_path = old.id2path(file_id) if old_value is not None else None if self_value is not None: entry = new._bytes_to_entry(self_value) new._fileid_to_entry_cache[file_id] = entry new_path = new.id2path(file_id) else: entry = None new_path = None delta.append((old_path, new_path, file_id, entry)) return InventoryDelta(delta) elif isinstance(old, Inventory) and isinstance(new, Inventory): return new._make_delta(old) else: old_ids = set(old.iter_all_ids()) new_ids = set(new.iter_all_ids()) adds = new_ids - old_ids deletes = old_ids - new_ids common = old_ids.intersection(new_ids) delta = [] for file_id in deletes: delta.append((old.id2path(file_id), None, file_id, None)) for file_id in adds: delta.append((None, new.id2path(file_id), file_id, new.get_entry(file_id))) for file_id in common: if old.get_entry(file_id) != new.get_entry(file_id): delta.append( ( old.id2path(file_id), new.id2path(file_id), file_id, new.get_entry(file_id), ) ) return InventoryDelta(delta) bzrformats_3.4.0.orig/bzrformats/inventory_delta.py0000644000000000000000000001002015162115103017613 0ustar00# Copyright (C) 2008, 2009 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Inventory delta serialisation. See doc/developers/inventory.txt for the description of the format. In this module the interesting classes are: - InventoryDeltaSerializer - object to read/write inventory deltas. """ __all__ = ["InventoryDeltaSerializer"] from ._bzr_rs import inventory as _inventory_delta_rs from .revision import RevisionID InventoryDeltaError = _inventory_delta_rs.InventoryDeltaError IncompatibleInventoryDelta = _inventory_delta_rs.IncompatibleInventoryDelta parse_inventory_entry = _inventory_delta_rs.parse_inventory_entry serialize_inventory_entry = _inventory_delta_rs.serialize_inventory_entry InventoryDelta = _inventory_delta_rs.InventoryDelta class InventoryDeltaSerializer: """Serialize inventory deltas.""" def __init__(self, versioned_root, tree_references): """Create an InventoryDeltaSerializer. :param versioned_root: If True, any root entry that is seen is expected to be versioned, and root entries can have any fileid. :param tree_references: If True support tree-reference entries. """ self._versioned_root = versioned_root self._tree_references = tree_references def delta_to_lines( self, old_name: RevisionID, new_name: RevisionID, delta_to_new: _inventory_delta_rs.InventoryDelta, ): """Return a line sequence for delta_to_new. Both the versioned_root and tree_references flags must be set via require_flags before calling this. :param old_name: A UTF8 revision id for the old inventory. May be NULL_REVISION if there is no older inventory and delta_to_new includes the entire inventory contents. :param new_name: The version name of the inventory we create with this delta. :param delta_to_new: An inventory delta such as Inventory.apply_delta takes. :return: The serialized delta as lines. """ return _inventory_delta_rs.serialize_inventory_delta( old_name, new_name, delta_to_new, self._versioned_root, self._tree_references, ) class InventoryDeltaDeserializer: """Deserialize inventory deltas.""" def __init__(self, allow_versioned_root=True, allow_tree_references=True): """Create an InventoryDeltaDeserializer. :param versioned_root: If True, any root entry that is seen is expected to be versioned, and root entries can have any fileid. :param tree_references: If True support tree-reference entries. """ self._allow_versioned_root = allow_versioned_root self._allow_tree_references = allow_tree_references def parse_text_bytes(self, lines): """Parse the text bytes of a serialized inventory delta. If versioned_root and/or tree_references flags were set via require_flags, then the parsed flags must match or a BzrError will be raised. :param lines: The lines to parse. This can be obtained by calling delta_to_lines. :return: (parent_id, new_id, versioned_root, tree_references, inventory_delta) """ return _inventory_delta_rs.parse_inventory_delta( lines, self._allow_versioned_root, self._allow_tree_references ) bzrformats_3.4.0.orig/bzrformats/knit.py0000644000000000000000000047403215162115103015373 0ustar00# Copyright (C) 2006-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Knit versionedfile implementation. A knit is a versioned file implementation that supports efficient append only updates. Knit file layout: lifeless: the data file is made up of "delta records". each delta record has a delta header that contains; (1) a version id, (2) the size of the delta (in lines), and (3) the digest of the -expanded data- (ie, the delta applied to the parent). the delta also ends with a end-marker; simply "end VERSION" delta can be line or full contents.a ... the 8's there are the index number of the annotation. version robertc@robertcollins.net-20051003014215-ee2990904cc4c7ad 7 c7d23b2a5bd6ca00e8e266cec0ec228158ee9f9e 59,59,3 8 8 if ie.executable: 8 e.set('executable', 'yes') 130,130,2 8 if elt.get('executable') == 'yes': 8 ie.executable = True end robertc@robertcollins.net-20051003014215-ee2990904cc4c7ad whats in an index: 09:33 < jrydberg> lifeless: each index is made up of a tuple of; version id, options, position, size, parents 09:33 < jrydberg> lifeless: the parents are currently dictionary compressed 09:33 < jrydberg> lifeless: (meaning it currently does not support ghosts) 09:33 < lifeless> right 09:33 < jrydberg> lifeless: the position and size is the range in the data file so the index sequence is the dictionary compressed sequence number used in the deltas to provide line annotation """ import contextlib import gzip import logging import operator import os from io import BytesIO from vcsgraph import tsort from bzrformats import pack from . import diff, osutils, pack_repo, tuned_gzip from . import index as _mod_index from .annotate import VersionedFileAnnotator from .errors import ( BzrFormatsError, InvalidRevisionId, NoSuchFile, ObjectNotLocked, ReadOnlyError, ReadOnlyObjectDirtiedError, RevisionAlreadyPresent, RevisionNotPresent, ) from .osutils import contains_whitespace, sha_string, sha_strings from .transport import TransportNoSuchFile from .versionedfile import ( AbsentContentFactory, ConstantMapper, ContentFactory, ExistingContent, UnavailableRepresentation, VersionedFilesWithFallbacks, _KeyRefs, adapter_registry, sort_groupcompress, ) evil_logger = logging.getLogger("bzrformats.evil") logger = logging.getLogger("bzrformats.knit") # TODO: Split out code specific to this format into an associated object. # TODO: Can we put in some kind of value to check that the index and data # files belong together? # TODO: accommodate binaries, perhaps by storing a byte count # TODO: function to check whole file # TODO: atomically append data, then measure backwards from the cursor # position after writing to work out where it was located. we may need to # bypass python file buffering. DATA_SUFFIX = ".knit" INDEX_SUFFIX = ".kndx" _STREAM_MIN_BUFFER_SIZE = 5 * 1024 * 1024 class KnitError(BzrFormatsError): """Base exception for errors related to knit file operations.""" _fmt = "Knit error" class KnitCorrupt(KnitError): """Raised when a knit file is found to be corrupt.""" _fmt = "Knit %(filename)s corrupt: %(how)s" def __init__(self, filename, how): """Initialize KnitCorrupt exception. Args: filename: The path to the corrupt knit file. how: Description of how the file is corrupt. """ KnitError.__init__(self) self.filename = filename self.how = how class SHA1KnitCorrupt(KnitCorrupt): """Raised when SHA-1 checksum validation fails for knit content.""" _fmt = ( "Knit %(filename)s corrupt: sha-1 of reconstructed text does not " "match expected sha-1. key %(key)s expected sha %(expected)s actual " "sha %(actual)s" ) def __init__(self, filename, actual, expected, key, content): """Initialize SHA1KnitCorrupt exception. Args: filename: The path to the corrupt knit file. actual: The actual SHA-1 hash computed. expected: The expected SHA-1 hash. key: The key of the corrupt content. content: The content that failed validation. """ KnitError.__init__(self) self.filename = filename self.actual = actual self.expected = expected self.key = key self.content = content class KnitDataStreamIncompatible(KnitError): """Raised when attempting to insert incompatible knit data streams. Not raised anymore, as we can convert data streams. In future we may need it again for more exotic cases, so we're keeping it around for now. """ _fmt = 'Cannot insert knit data stream of format "%(stream_format)s" into knit of format "%(target_format)s".' def __init__(self, stream_format, target_format): """Initialize KnitDataStreamIncompatible exception. Args: stream_format: The format of the data stream being inserted. target_format: The format of the target knit. """ self.stream_format = stream_format self.target_format = target_format class KnitDataStreamUnknown(KnitError): """Raised when encountering an unknown knit data stream format. Indicates a data stream we don't know how to handle. """ _fmt = 'Cannot parse knit data stream of format "%(stream_format)s".' def __init__(self, stream_format): """Initialize KnitDataStreamUnknown exception. Args: stream_format: The unknown format of the data stream. """ self.stream_format = stream_format class KnitHeaderError(KnitError): """Raised when a knit file header is malformed or unexpected.""" _fmt = 'Knit header error: %(badline)r unexpected for file "%(filename)s".' def __init__(self, badline, filename): """Initialize KnitHeaderError exception. Args: badline: The malformed header line. filename: The path to the knit file with the bad header. """ KnitError.__init__(self) self.badline = badline self.filename = filename class KnitIndexUnknownMethod(KnitError): """Raised when we don't understand the storage method. Currently only 'fulltext' and 'line-delta' are supported. """ _fmt = ( "Knit index %(filename)s does not have a known method in options: %(options)r" ) def __init__(self, filename, options): """Initialize KnitIndexUnknownMethod exception. Args: filename: The path to the knit index file. options: The unknown options/methods found in the index. """ KnitError.__init__(self) self.filename = filename self.options = options class KnitAdapter: """Base class for knit record adaption.""" def __init__(self, basis_vf): """Create an adapter which accesses full texts from basis_vf. :param basis_vf: A versioned file to access basis texts of deltas from. May be None for adapters that do not need to access basis texts. """ self._data = KnitVersionedFiles(None, None) self._annotate_factory = KnitAnnotateFactory() self._plain_factory = KnitPlainFactory() self._basis_vf = basis_vf class FTAnnotatedToUnannotated(KnitAdapter): """An adapter from FT annotated knits to unannotated ones.""" def get_bytes(self, factory, target_storage_kind): """Convert annotated fulltext knit records to unannotated format. Args: factory: The record factory containing the raw knit data. target_storage_kind: The desired storage format for the output. Returns: The converted unannotated knit data as bytes. Raises: UnavailableRepresentation: If target format is not 'knit-ft-gz'. """ if target_storage_kind != "knit-ft-gz": raise UnavailableRepresentation( factory.key, target_storage_kind, factory.storage_kind ) annotated_compressed_bytes = factory._raw_record rec, contents = self._data._parse_record_unchecked(annotated_compressed_bytes) content = self._annotate_factory.parse_fulltext(contents, rec[1]) _size, chunks = self._data._record_to_data((rec[1],), rec[3], content.text()) return b"".join(chunks) class DeltaAnnotatedToUnannotated(KnitAdapter): """An adapter for deltas from annotated to unannotated.""" def get_bytes(self, factory, target_storage_kind): """Convert annotated delta knit records to unannotated format. Args: factory: The record factory containing the raw knit data. target_storage_kind: The desired storage format for the output. Returns: The converted unannotated delta data as bytes. Raises: UnavailableRepresentation: If target format is not 'knit-delta-gz'. """ if target_storage_kind != "knit-delta-gz": raise UnavailableRepresentation( factory.key, target_storage_kind, factory.storage_kind ) annotated_compressed_bytes = factory._raw_record rec, contents = self._data._parse_record_unchecked(annotated_compressed_bytes) delta = self._annotate_factory.parse_line_delta(contents, rec[1], plain=True) contents = self._plain_factory.lower_line_delta(delta) _size, chunks = self._data._record_to_data((rec[1],), rec[3], contents) return b"".join(chunks) class FTAnnotatedToFullText(KnitAdapter): """An adapter from FT annotated knits to unannotated ones.""" def get_bytes(self, factory, target_storage_kind): """Convert annotated fulltext knit records to plain fulltext. Args: factory: The record factory containing the raw knit data. target_storage_kind: The desired storage format ('fulltext', 'chunked', or 'lines'). Returns: The converted fulltext data in the requested format. Raises: UnavailableRepresentation: If target format is not supported. """ annotated_compressed_bytes = factory._raw_record _rec, contents = self._data._parse_record_unchecked(annotated_compressed_bytes) content, _delta = self._annotate_factory.parse_record( factory.key[-1], contents, factory._build_details, None ) if target_storage_kind == "fulltext": return b"".join(content.text()) elif target_storage_kind in ("chunked", "lines"): return content.text() raise UnavailableRepresentation( factory.key, target_storage_kind, factory.storage_kind ) class DeltaAnnotatedToFullText(KnitAdapter): """An adapter for deltas from annotated to unannotated.""" def get_bytes(self, factory, target_storage_kind): """Apply annotated delta to basis text and return fulltext. Args: factory: The record factory containing the raw delta data. target_storage_kind: The desired storage format ('fulltext', 'chunked', or 'lines'). Returns: The reconstructed fulltext data in the requested format. Raises: RevisionNotPresent: If the compression parent is not available. UnavailableRepresentation: If target format is not supported. """ annotated_compressed_bytes = factory._raw_record rec, contents = self._data._parse_record_unchecked(annotated_compressed_bytes) delta = self._annotate_factory.parse_line_delta(contents, rec[1], plain=True) compression_parent = factory.parents[0] basis_entry = next( self._basis_vf.get_record_stream([compression_parent], "unordered", True) ) if basis_entry.storage_kind == "absent": raise RevisionNotPresent(compression_parent, self._basis_vf) basis_lines = basis_entry.get_bytes_as("lines") # Manually apply the delta because we have one annotated content and # one plain. basis_content = PlainKnitContent(basis_lines, compression_parent) basis_content.apply_delta(delta, rec[1]) basis_content._should_strip_eol = factory._build_details[1] if target_storage_kind == "fulltext": return b"".join(basis_content.text()) elif target_storage_kind in ("chunked", "lines"): return basis_content.text() raise UnavailableRepresentation( factory.key, target_storage_kind, factory.storage_kind ) class FTPlainToFullText(KnitAdapter): """An adapter from FT plain knits to unannotated ones.""" def get_bytes(self, factory, target_storage_kind): """Convert plain fulltext knit records to fulltext format. Args: factory: The record factory containing the raw knit data. target_storage_kind: The desired storage format ('fulltext', 'chunked', or 'lines'). Returns: The fulltext data in the requested format. Raises: UnavailableRepresentation: If target format is not supported. """ compressed_bytes = factory._raw_record _rec, contents = self._data._parse_record_unchecked(compressed_bytes) content, _delta = self._plain_factory.parse_record( factory.key[-1], contents, factory._build_details, None ) if target_storage_kind == "fulltext": return b"".join(content.text()) elif target_storage_kind in ("chunked", "lines"): return content.text() raise UnavailableRepresentation( factory.key, target_storage_kind, factory.storage_kind ) class DeltaPlainToFullText(KnitAdapter): """An adapter for deltas from annotated to unannotated.""" def get_bytes(self, factory, target_storage_kind): """Apply plain delta to basis text and return fulltext. Args: factory: The record factory containing the raw delta data. target_storage_kind: The desired storage format ('fulltext', 'chunked', or 'lines'). Returns: The reconstructed fulltext data in the requested format. Raises: RevisionNotPresent: If the compression parent is not available. UnavailableRepresentation: If target format is not supported. """ compressed_bytes = factory._raw_record rec, contents = self._data._parse_record_unchecked(compressed_bytes) self._plain_factory.parse_line_delta(contents, rec[1]) compression_parent = factory.parents[0] # XXX: string splitting overhead. basis_entry = next( self._basis_vf.get_record_stream([compression_parent], "unordered", True) ) if basis_entry.storage_kind == "absent": raise RevisionNotPresent(compression_parent, self._basis_vf) basis_lines = basis_entry.get_bytes_as("lines") basis_content = PlainKnitContent(basis_lines, compression_parent) # Manually apply the delta because we have one annotated content and # one plain. content, _ = self._plain_factory.parse_record( rec[1], contents, factory._build_details, basis_content ) if target_storage_kind == "fulltext": return b"".join(content.text()) elif target_storage_kind in ("chunked", "lines"): return content.text() raise UnavailableRepresentation( factory.key, target_storage_kind, factory.storage_kind ) class KnitContentFactory(ContentFactory): """Content factory for streaming from knits. :seealso ContentFactory: """ def __init__( self, key, parents, build_details, sha1, raw_record, annotated, knit=None, network_bytes=None, ): """Create a KnitContentFactory for key. :param key: The key. :param parents: The parents. :param build_details: The build details as returned from get_build_details. :param sha1: The sha1 expected from the full text of this object. :param raw_record: The bytes of the knit data from disk. :param annotated: True if the raw data is annotated. :param network_bytes: None to calculate the network bytes on demand, not-none if they are already known. """ ContentFactory.__init__(self) self.sha1 = sha1 self.key = key self.parents = parents kind = "delta" if build_details[0] == "line-delta" else "ft" annotated_kind = "annotated-" if annotated else "" self.storage_kind = f"knit-{annotated_kind}{kind}-gz" self._raw_record = raw_record self._network_bytes = network_bytes self._build_details = build_details self._knit = knit def _create_network_bytes(self): """Create a fully serialised network version for transmission.""" # storage_kind, key, parents, Noeol, raw_record key_bytes = b"\x00".join(self.key) if self.parents is None: parent_bytes = b"None:" else: parent_bytes = b"\t".join(b"\x00".join(key) for key in self.parents) noeol = b"N" if self._build_details[1] else b" " network_bytes = b"%s\n%s\n%s\n%s%s" % ( self.storage_kind.encode("ascii"), key_bytes, parent_bytes, noeol, self._raw_record, ) self._network_bytes = network_bytes def get_bytes_as(self, storage_kind): """Get the bytes for this content in the specified storage format. Args: storage_kind: The desired storage format. Returns: The content bytes in the requested format. Raises: UnavailableRepresentation: If the format is not available. """ if storage_kind == self.storage_kind: if self._network_bytes is None: self._create_network_bytes() return self._network_bytes if "-ft-" in self.storage_kind and storage_kind in ( "chunked", "fulltext", "lines", ): adapter_key = (self.storage_kind, storage_kind) adapter_factory = adapter_registry.get(adapter_key) adapter = adapter_factory(None) return adapter.get_bytes(self, storage_kind) if self._knit is not None: # Not redundant with direct conversion above - that only handles # fulltext cases. if storage_kind in ("chunked", "lines"): return self._knit.get_lines(self.key[0]) elif storage_kind == "fulltext": return self._knit.get_text(self.key[0]) raise UnavailableRepresentation(self.key, storage_kind, self.storage_kind) def iter_bytes_as(self, storage_kind): """Iterate over the bytes for this content in the specified format. Args: storage_kind: The desired storage format. Returns: An iterator over the content bytes. """ return iter(self.get_bytes_as(storage_kind)) class LazyKnitContentFactory(ContentFactory): """A ContentFactory which can either generate full text or a wire form. :seealso ContentFactory: """ def __init__(self, key, parents, generator, first): """Create a LazyKnitContentFactory. :param key: The key of the record. :param parents: The parents of the record. :param generator: A _ContentMapGenerator containing the record for this key. :param first: Is this the first content object returned from generator? if it is, its storage kind is knit-delta-closure, otherwise it is knit-delta-closure-ref """ self.key = key self.parents = parents self.sha1 = None self.size = None self._generator = generator self.storage_kind = "knit-delta-closure" if not first: self.storage_kind = self.storage_kind + "-ref" self._first = first def get_bytes_as(self, storage_kind): """Get the bytes for this lazy content in the specified storage format. Args: storage_kind: The desired storage format. Returns: The content bytes in the requested format. Raises: UnavailableRepresentation: If the format is not available. """ if storage_kind == self.storage_kind: if self._first: return self._generator._wire_bytes() else: # all the keys etc are contained in the bytes returned in the # first record. return b"" if storage_kind in ("chunked", "fulltext", "lines"): chunks = self._generator._get_one_work(self.key).text() if storage_kind in ("chunked", "lines"): return chunks else: return b"".join(chunks) raise UnavailableRepresentation(self.key, storage_kind, self.storage_kind) def iter_bytes_as(self, storage_kind): """Iterate over the bytes for this lazy content in the specified format. Args: storage_kind: The desired storage format. Returns: An iterator over the content chunks. Raises: UnavailableRepresentation: If the format is not available. """ if storage_kind in ("chunked", "lines"): chunks = self._generator._get_one_work(self.key).text() return iter(chunks) raise UnavailableRepresentation(self.key, storage_kind, self.storage_kind) def knit_delta_closure_to_records(storage_kind, bytes, line_end): """Convert a network record to a iterator over stream records. :param storage_kind: The storage kind of the record. Must be 'knit-delta-closure'. :param bytes: The bytes of the record on the network. """ generator = _NetworkContentMapGenerator(bytes, line_end) return generator.get_record_stream() def knit_network_to_record(storage_kind, bytes, line_end): """Convert a network record to a record object. :param storage_kind: The storage kind of the record. :param bytes: The bytes of the record on the network. """ start = line_end line_end = bytes.find(b"\n", start) key = tuple(bytes[start:line_end].split(b"\x00")) start = line_end + 1 line_end = bytes.find(b"\n", start) parent_line = bytes[start:line_end] if parent_line == b"None:": parents = None else: parents = tuple( [ tuple(segment.split(b"\x00")) for segment in parent_line.split(b"\t") if segment ] ) start = line_end + 1 noeol = bytes[start : start + 1] == b"N" method = "fulltext" if "ft" in storage_kind else "line-delta" build_details = (method, noeol) start = start + 1 raw_record = bytes[start:] annotated = "annotated" in storage_kind return [ KnitContentFactory( key, parents, build_details, None, raw_record, annotated, network_bytes=bytes, ) ] class KnitContent: r"""Content of a knit version to which deltas can be applied. This is always stored in memory as a list of lines with \n at the end, plus a flag saying if the final ending is really there or not, because that corresponds to the on-disk knit representation. """ def __init__(self): """Initialize KnitContent.""" self._should_strip_eol = False def apply_delta(self, delta, new_version_id): """Apply delta to this object to become new_version_id.""" raise NotImplementedError(self.apply_delta) def line_delta_iter(self, new_lines): """Generate line-based delta from this content to new_lines.""" import patiencediff new_texts = new_lines.text() old_texts = self.text() s = patiencediff.PatienceSequenceMatcher(None, old_texts, new_texts) for tag, i1, i2, j1, j2 in s.get_opcodes(): if tag == "equal": continue # ofrom, oto, length, data yield i1, i2, j2 - j1, new_lines._lines[j1:j2] def line_delta(self, new_lines): """Get the line delta between this content and new_lines. Args: new_lines: The target content to generate a delta to. Returns: A list of delta operations. """ return list(self.line_delta_iter(new_lines)) @staticmethod def get_line_delta_blocks(knit_delta, source, target): """Extract SequenceMatcher.get_matching_blocks() from a knit delta.""" target_len = len(target) s_pos = 0 t_pos = 0 for s_begin, s_end, t_len, _new_text in knit_delta: true_n = s_begin - s_pos n = true_n if n > 0: # knit deltas do not provide reliable info about whether the # last line of a file matches, due to eol handling. if source[s_pos + n - 1] != target[t_pos + n - 1]: n -= 1 if n > 0: yield s_pos, t_pos, n t_pos += t_len + true_n s_pos = s_end n = target_len - t_pos if n > 0: if source[s_pos + n - 1] != target[t_pos + n - 1]: n -= 1 if n > 0: yield s_pos, t_pos, n yield s_pos + (target_len - t_pos), target_len, 0 class AnnotatedKnitContent(KnitContent): """Annotated content.""" def __init__(self, lines): """Initialize AnnotatedKnitContent. Args: lines: An iterable of (origin, text) tuples representing annotated lines. """ KnitContent.__init__(self) self._lines = list(lines) def annotate(self): """Return a list of (origin, text) for each content line.""" lines = self._lines[:] if self._should_strip_eol: origin, last_line = lines[-1] lines[-1] = (origin, last_line.rstrip(b"\n")) return lines def apply_delta(self, delta, new_version_id): """Apply delta to this object to become new_version_id.""" offset = 0 lines = self._lines for start, end, count, delta_lines in delta: lines[offset + start : offset + end] = delta_lines offset = offset + (start - end) + count def text(self): """Return the text content without annotations. Returns: A list of text lines. Raises: KnitCorrupt: If annotation information is missing. """ try: lines = [text for origin, text in self._lines] except ValueError as e: # most commonly (only?) caused by the internal form of the knit # missing annotation information because of a bug - see thread # around 20071015 raise KnitCorrupt( self, f"line in annotated knit missing annotation information: {e}" ) from e if self._should_strip_eol: lines[-1] = lines[-1].rstrip(b"\n") return lines def copy(self): """Create a copy of this annotated content. Returns: A new AnnotatedKnitContent instance with the same lines. """ return AnnotatedKnitContent(self._lines) class PlainKnitContent(KnitContent): """Unannotated content. When annotate[_iter] is called on this content, the same version is reported for all lines. Generally, annotate[_iter] is not useful on PlainKnitContent objects. """ def __init__(self, lines, version_id): """Initialize PlainKnitContent. Args: lines: A list of text lines. version_id: The version identifier for this content. """ KnitContent.__init__(self) self._lines = lines self._version_id = version_id def annotate(self): """Return a list of (origin, text) for each content line.""" return [(self._version_id, line) for line in self._lines] def apply_delta(self, delta, new_version_id): """Apply delta to this object to become new_version_id.""" offset = 0 lines = self._lines for start, end, count, delta_lines in delta: lines[offset + start : offset + end] = delta_lines offset = offset + (start - end) + count self._version_id = new_version_id def copy(self): """Create a copy of this plain content. Returns: A new PlainKnitContent instance with the same lines and version. """ return PlainKnitContent(self._lines[:], self._version_id) def text(self): """Return the text content. Returns: A list of text lines, possibly with the final EOL stripped. """ lines = self._lines if self._should_strip_eol: lines = lines[:] lines[-1] = lines[-1].rstrip(b"\n") return lines class _KnitFactory: """Base class for common Factory functions.""" def parse_record( self, version_id, record, record_details, base_content, copy_base_content=True ): """Parse a record into a full content object. :param version_id: The official version id for this content :param record: The data returned by read_records_iter() :param record_details: Details about the record returned by get_build_details :param base_content: If get_build_details returns a compression_parent, you must return a base_content here, else use None :param copy_base_content: When building from the base_content, decide you can either copy it and return a new object, or modify it in place. :return: (content, delta) A Content object and possibly a line-delta, delta may be None """ method, noeol = record_details if method == "line-delta": content = base_content.copy() if copy_base_content else base_content delta = self.parse_line_delta(record, version_id) content.apply_delta(delta, version_id) else: content = self.parse_fulltext(record, version_id) delta = None content._should_strip_eol = noeol return (content, delta) class KnitAnnotateFactory(_KnitFactory): """Factory for creating annotated Content objects.""" annotated = True def make(self, lines, version_id): """Create an AnnotatedKnitContent from lines and version_id. Args: lines: The text lines to annotate. version_id: The version identifier to assign to all lines. Returns: An AnnotatedKnitContent instance. """ num_lines = len(lines) return AnnotatedKnitContent(zip([version_id] * num_lines, lines, strict=False)) def parse_fulltext(self, content, version_id): r"""Convert fulltext to internal representation. fulltext content is of the format revid(utf8) plaintext\n internal representation is of the format: (revid, plaintext) """ # TODO: jam 20070209 The tests expect this to be returned as tuples, # but the code itself doesn't really depend on that. # Figure out a way to not require the overhead of turning the # list back into tuples. lines = (tuple(line.split(b" ", 1)) for line in content) return AnnotatedKnitContent(lines) def parse_line_delta(self, lines, version_id, plain=False): r"""Convert a line based delta into internal representation. line delta is in the form of: intstart intend intcount 1..count lines: revid(utf8) newline\n internal representation is (start, end, count, [1..count tuples (revid, newline)]) :param plain: If True, the lines are returned as a plain list without annotations, not as a list of (origin, content) tuples, i.e. (start, end, count, [1..count newline]) """ result = [] lines = iter(lines) cache = {} def cache_and_return(line): origin, text = line.split(b" ", 1) return cache.setdefault(origin, origin), text # walk through the lines parsing. # Note that the plain test is explicitly pulled out of the # loop to minimise any performance impact if plain: for header in lines: start, end, count = (int(n) for n in header.split(b",")) contents = [next(lines).split(b" ", 1)[1] for _ in range(count)] result.append((start, end, count, contents)) else: for header in lines: start, end, count = (int(n) for n in header.split(b",")) contents = [tuple(next(lines).split(b" ", 1)) for _ in range(count)] result.append((start, end, count, contents)) return result def get_fulltext_content(self, lines): """Extract just the content lines from a fulltext.""" return (line.split(b" ", 1)[1] for line in lines) def get_linedelta_content(self, lines): """Extract just the content from a line delta. This doesn't return all of the extra information stored in a delta. Only the actual content lines. """ lines = iter(lines) for header in lines: header = header.split(b",") count = int(header[2]) for _ in range(count): _origin, text = next(lines).split(b" ", 1) yield text def lower_fulltext(self, content): """Convert a fulltext content record into a serializable form. see parse_fulltext which this inverts. """ return [b"%s %s" % (o, t) for o, t in content._lines] def lower_line_delta(self, delta): """Convert a delta into a serializable form. See parse_line_delta which this inverts. """ # TODO: jam 20070209 We only do the caching thing to make sure that # the origin is a valid utf-8 line, eventually we could remove it out = [] for start, end, c, lines in delta: out.append(b"%d,%d,%d\n" % (start, end, c)) out.extend(origin + b" " + text for origin, text in lines) return out def annotate(self, knit, key): """Get annotated lines for a given key. Args: knit: The knit storage to read from. key: The version key to annotate. Returns: A list of (origin, text) tuples for each line. """ content = knit._get_content(key) # adjust for the fact that serialised annotations are only key suffixes # for this factory. if isinstance(key, tuple): prefix = key[:-1] origins = content.annotate() result = [] for origin, line in origins: result.append((prefix + (origin,), line)) return result else: # XXX: This smells a bit. Why would key ever be a non-tuple here? # Aren't keys defined to be tuples? -- spiv 20080618 return content.annotate() class KnitPlainFactory(_KnitFactory): """Factory for creating plain Content objects.""" annotated = False def make(self, lines, version_id): """Create a PlainKnitContent from lines and version_id. Args: lines: The text lines. version_id: The version identifier. Returns: A PlainKnitContent instance. """ return PlainKnitContent(lines, version_id) def parse_fulltext(self, content, version_id): """This parses an unannotated fulltext. Note that this is not a noop - the internal representation has (versionid, line) - its just a constant versionid. """ return self.make(content, version_id) def parse_line_delta_iter(self, lines, version_id): """Parse line delta records into an iterator of delta operations. Args: lines: The delta lines to parse. version_id: The version identifier (unused for plain content). Yields: Tuples of (start, end, count, lines) for each delta operation. """ cur = 0 num_lines = len(lines) while cur < num_lines: header = lines[cur] cur += 1 start, end, c = (int(n) for n in header.split(b",")) yield start, end, c, lines[cur : cur + c] cur += c def parse_line_delta(self, lines, version_id): """Parse line delta records into a list of delta operations. Args: lines: The delta lines to parse. version_id: The version identifier (unused for plain content). Returns: A list of (start, end, count, lines) tuples. """ return list(self.parse_line_delta_iter(lines, version_id)) def get_fulltext_content(self, lines): """Extract just the content lines from a fulltext.""" return iter(lines) def get_linedelta_content(self, lines): """Extract just the content from a line delta. This doesn't return all of the extra information stored in a delta. Only the actual content lines. """ lines = iter(lines) for header in lines: header = header.split(b",") count = int(header[2]) for _ in range(count): yield next(lines) def lower_fulltext(self, content): """Convert a fulltext content record into a serializable form. Args: content: The PlainKnitContent to serialize. Returns: The text lines. """ return content.text() def lower_line_delta(self, delta): """Convert a delta into a serializable form. Args: delta: The delta to serialize. Returns: A list of serialized delta lines. """ out = [] for start, end, c, lines in delta: out.append(b"%d,%d,%d\n" % (start, end, c)) out.extend(lines) return out def annotate(self, knit, key): """Get annotated lines for a given key using a KnitAnnotator. Args: knit: The knit storage to read from. key: The version key to annotate. Returns: A list of (origin, text) tuples for each line. """ annotator = _KnitAnnotator(knit) return annotator.annotate_flat(key) def make_file_factory(annotated, mapper): """Create a factory for creating a file based KnitVersionedFiles. This is only functional enough to run interface tests, it doesn't try to provide a full pack environment. :param annotated: knit annotations are wanted. :param mapper: The mapper from keys to paths. """ def factory(transport): index = _KndxIndex(transport, mapper, lambda: None, lambda: True, lambda: True) access = _KnitKeyAccess(transport, mapper) return KnitVersionedFiles(index, access, annotated=annotated) return factory def make_pack_factory(graph, delta, keylength): """Create a factory for creating a pack based VersionedFiles. This is only functional enough to run interface tests, it doesn't try to provide a full pack environment. :param graph: Store a graph. :param delta: Delta compress contents. :param keylength: How long should keys be. """ def factory(transport): parents = graph or delta ref_length = 0 if graph: ref_length += 1 if delta: ref_length += 1 max_delta_chain = 200 else: max_delta_chain = 0 graph_index = _mod_index.InMemoryGraphIndex( reference_lists=ref_length, key_elements=keylength ) stream = transport.open_write_stream("newpack") writer = pack.ContainerWriter(stream.write) writer.begin() index = _KnitGraphIndex( graph_index, lambda: True, parents=parents, deltas=delta, add_callback=graph_index.add_nodes, ) access = pack_repo._DirectPackAccess({}) access.set_writer(writer, graph_index, (transport, "newpack")) result = KnitVersionedFiles(index, access, max_delta_chain=max_delta_chain) result.stream = stream result.writer = writer return result return factory def cleanup_pack_knit(versioned_files): """Clean up resources used by a pack knit versioned files instance. Args: versioned_files: The KnitVersionedFiles instance to clean up. """ versioned_files.stream.close() versioned_files.writer.end() def _get_total_build_size(self, keys, positions): """Determine the total bytes to build these keys. (helper function because _KnitGraphIndex and _KndxIndex work the same, but don't inherit from a common base.) :param keys: Keys that we want to build :param positions: dict of {key, (info, index_memo, comp_parent)} (such as returned by _get_components_positions) :return: Number of bytes to build those keys """ all_build_index_memos = {} build_keys = keys while build_keys: next_keys = set() for key in build_keys: # This is mostly for the 'stacked' case # Where we will be getting the data from a fallback if key not in positions: continue _, index_memo, compression_parent = positions[key] all_build_index_memos[key] = index_memo if compression_parent not in all_build_index_memos: next_keys.add(compression_parent) build_keys = next_keys return sum(index_memo[2] for index_memo in all_build_index_memos.values()) class KnitVersionedFiles(VersionedFilesWithFallbacks): """Storage for many versioned files using knit compression. Backend storage is managed by indices and data objects. :ivar _index: A _KnitGraphIndex or similar that can describe the parents, graph, compression and data location of entries in this KnitVersionedFiles. Note that this is only the index for *this* vfs; if there are fallbacks they must be queried separately. """ def __init__( self, index, data_access, max_delta_chain=200, annotated=False, reload_func=None ): """Create a KnitVersionedFiles with index and data_access. :param index: The index for the knit data. :param data_access: The access object to store and retrieve knit records. :param max_delta_chain: The maximum number of deltas to permit during insertion. Set to 0 to prohibit the use of deltas. :param annotated: Set to True to cause annotations to be calculated and stored during insertion. :param reload_func: An function that can be called if we think we need to reload the pack listing and try again. See 'bzrformats.pack_repo.AggregateIndex' for the signature. """ self._index = index self._access = data_access self._max_delta_chain = max_delta_chain if annotated: self._factory = KnitAnnotateFactory() else: self._factory = KnitPlainFactory() self._immediate_fallback_vfs = [] self._reload_func = reload_func def __repr__(self): """Return a string representation of this KnitVersionedFiles. Returns: A string showing the class name, index, and access objects. """ return f"{self.__class__.__name__}({self._index!r}, {self._access!r})" def without_fallbacks(self): """Return a clone of this object without any fallbacks configured.""" return KnitVersionedFiles( self._index, self._access, self._max_delta_chain, self._factory.annotated, self._reload_func, ) def add_fallback_versioned_files(self, a_versioned_files): """Add a source of texts for texts not present in this knit. :param a_versioned_files: A VersionedFiles object. """ self._immediate_fallback_vfs.append(a_versioned_files) def add_lines( self, key, parents, lines, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): """See VersionedFiles.add_lines().""" self._index._check_write_ok() self._check_add(key, lines, random_id, check_content) if parents is None: # The caller might pass None if there is no graph data, but kndx # indexes can't directly store that, so we give them # an empty tuple instead. parents = () line_bytes = b"".join(lines) return self._add( key, lines, parents, parent_texts, left_matching_blocks, nostore_sha, random_id, line_bytes=line_bytes, ) def add_content( self, content_factory, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, ): """See VersionedFiles.add_content().""" self._index._check_write_ok() key = content_factory.key parents = content_factory.parents self._check_add(key, None, random_id, check_content=False) if parents is None: # The caller might pass None if there is no graph data, but kndx # indexes can't directly store that, so we give them # an empty tuple instead. parents = () lines = content_factory.get_bytes_as("lines") line_bytes = content_factory.get_bytes_as("fulltext") return self._add( key, lines, parents, parent_texts, left_matching_blocks, nostore_sha, random_id, line_bytes=line_bytes, ) def _add( self, key, lines, parents, parent_texts, left_matching_blocks, nostore_sha, random_id, line_bytes, ): """Add a set of lines on top of version specified by parents. Any versions not present will be converted into ghosts. :param lines: A list of strings where each one is a single line (has a single newline at the end of the string) This is now optional (callers can pass None). It is left in its location for backwards compatibility. It should ''.join(lines) must == line_bytes :param line_bytes: A single string containing the content We pass both lines and line_bytes because different routes bring the values to this function. And for memory efficiency, we don't want to have to split/join on-demand. """ # first thing, if the content is something we don't need to store, find # that out. digest = sha_string(line_bytes) if nostore_sha == digest: raise ExistingContent present_parents = [] if parent_texts is None: parent_texts = {} # Do a single query to ascertain parent presence; we only compress # against parents in the same kvf. present_parent_map = self._index.get_parent_map(parents) for parent in parents: if parent in present_parent_map: present_parents.append(parent) # Currently we can only compress against the left most present parent. if len(present_parents) == 0 or present_parents[0] != parents[0]: delta = False else: # To speed the extract of texts the delta chain is limited # to a fixed number of deltas. This should minimize both # I/O and the time spend applying deltas. delta = self._check_should_delta(present_parents[0]) text_length = len(line_bytes) options = [] no_eol = False # Note: line_bytes is not modified to add a newline, that is tracked # via the no_eol flag. 'lines' *is* modified, because that is the # general values needed by the Content code. if line_bytes and not line_bytes.endswith(b"\n"): options.append(b"no-eol") no_eol = True # Copy the existing list, or create a new one lines = osutils.split_lines(line_bytes) if lines is None else lines[:] # Replace the last line with one that ends in a final newline lines[-1] = lines[-1] + b"\n" if lines is None: lines = osutils.split_lines(line_bytes) for element in key[:-1]: if not isinstance(element, bytes): raise TypeError(f"key contains non-bytestrings: {key!r}") if key[-1] is None: key = key[:-1] + (b"sha1:" + digest,) elif not isinstance(key[-1], bytes): raise TypeError(f"key contains non-bytestrings: {key!r}") # Knit hunks are still last-element only version_id = key[-1] content = self._factory.make(lines, version_id) if no_eol: # Hint to the content object that its text() call should strip the # EOL. content._should_strip_eol = True if delta or (self._factory.annotated and len(present_parents) > 0): # Merge annotations from parent texts if needed. delta_hunks = self._merge_annotations( content, present_parents, parent_texts, delta, self._factory.annotated, left_matching_blocks, ) if delta: options.append(b"line-delta") store_lines = self._factory.lower_line_delta(delta_hunks) size, data = self._record_to_data(key, digest, store_lines) else: options.append(b"fulltext") # isinstance is slower and we have no hierarchy. if self._factory.__class__ is KnitPlainFactory: # Use the already joined bytes saving iteration time in # _record_to_data. dense_lines = [line_bytes] if no_eol: dense_lines.append(b"\n") size, data = self._record_to_data(key, digest, lines, dense_lines) else: # get mixed annotation + content and feed it into the # serialiser. store_lines = self._factory.lower_fulltext(content) size, data = self._record_to_data(key, digest, store_lines) access_memo = self._access.add_raw_record(key, size, data) self._index.add_records( ((key, options, access_memo, parents),), random_id=random_id ) return digest, text_length, content def annotate(self, key): """See VersionedFiles.annotate.""" return self._factory.annotate(self, key) def get_annotator(self): """Get an annotator for this knit. Returns: A _KnitAnnotator instance for annotating content. """ return _KnitAnnotator(self) def check(self, progress_bar=None, keys=None): """See VersionedFiles.check().""" if keys is None: return self._logical_check() else: # At the moment, check does not extra work over get_record_stream return self.get_record_stream(keys, "unordered", True) def _logical_check(self): # This doesn't actually test extraction of everything, but that will # impact 'bzr check' substantially, and needs to be integrated with # care. However, it does check for the obvious problem of a delta with # no basis. keys = self._index.keys() parent_map = self.get_parent_map(keys) for key in keys: if self._index.get_method(key) != "fulltext": compression_parent = parent_map[key][0] if compression_parent not in parent_map: raise KnitCorrupt( self, f"Missing basis parent {compression_parent} for {key}", ) for fallback_vfs in self._immediate_fallback_vfs: fallback_vfs.check() def _check_add(self, key, lines, random_id, check_content): """Check that version_id and lines are safe to add.""" if not all(isinstance(x, bytes) or x is None for x in key): raise TypeError(key) version_id = key[-1] if version_id is not None: if contains_whitespace(version_id): raise InvalidRevisionId(version_id, self) self.check_not_reserved_id(version_id) # TODO: If random_id==False and the key is already present, we should # probably check that the existing content is identical to what is # being inserted, and otherwise raise an exception. This would make # the bundle code simpler. if check_content: self._check_lines_not_unicode(lines) self._check_lines_are_lines(lines) def _check_header(self, key, line): rec = self._split_header(line) self._check_header_version(rec, key[-1]) return rec def _check_header_version(self, rec, version_id): """Checks the header version on original format knit records. These have the last component of the key embedded in the record. """ if rec[1] != version_id: raise KnitCorrupt( self, f"unexpected version, wanted {version_id!r}, got {rec[1]!r}" ) def _check_should_delta(self, parent): """Iterate back through the parent listing, looking for a fulltext. This is used when we want to decide whether to add a delta or a new fulltext. It searches for _max_delta_chain parents. When it finds a fulltext parent, it sees if the total size of the deltas leading up to it is large enough to indicate that we want a new full text anyway. Return True if we should create a new delta, False if we should use a full text. """ delta_size = 0 fulltext_size = None for _count in range(self._max_delta_chain): try: # Note that this only looks in the index of this particular # KnitVersionedFiles, not in the fallbacks. This ensures that # we won't store a delta spanning physical repository # boundaries. build_details = self._index.get_build_details([parent]) parent_details = build_details[parent] except (RevisionNotPresent, KeyError): # Some basis is not locally present: always fulltext return False index_memo, compression_parent, _, _ = parent_details _, _, size = index_memo if compression_parent is None: fulltext_size = size break delta_size += size # We don't explicitly check for presence because this is in an # inner loop, and if it's missing it'll fail anyhow. parent = compression_parent else: # We couldn't find a fulltext, so we must create a new one return False # Simple heuristic - if the total I/O wold be greater as a delta than # the originally installed fulltext, we create a new fulltext. return fulltext_size > delta_size def _build_details_to_components(self, build_details): """Convert a build_details tuple to a position tuple.""" # record_details, access_memo, compression_parent return build_details[3], build_details[0], build_details[1] def _get_components_positions(self, keys, allow_missing=False): """Produce a map of position data for the components of keys. This data is intended to be used for retrieving the knit records. A dict of key to (record_details, index_memo, next, parents) is returned. * method is the way referenced data should be applied. * index_memo is the handle to pass to the data access to actually get the data * next is the build-parent of the version, or None for fulltexts. * parents is the version_ids of the parents of this version :param allow_missing: If True do not raise an error on a missing component, just ignore it. """ component_data = {} pending_components = keys while pending_components: build_details = self._index.get_build_details(pending_components) current_components = set(pending_components) pending_components = set() for key, details in build_details.items(): (_index_memo, compression_parent, _parents, _record_details) = details if compression_parent is not None: pending_components.add(compression_parent) component_data[key] = self._build_details_to_components(details) missing = current_components.difference(build_details) if missing and not allow_missing: raise RevisionNotPresent(missing.pop(), self) return component_data def _get_content(self, key, parent_texts=None): """Returns a content object that makes up the specified version. """ if parent_texts is None: parent_texts = {} cached_version = parent_texts.get(key) if cached_version is not None: # Ensure the cache dict is valid. if not self.get_parent_map([key]): raise RevisionNotPresent(key, self) return cached_version generator = _VFContentMapGenerator(self, [key]) return generator._get_content(key) def get_parent_map(self, keys): """Get a map of the graph parents of keys. :param keys: The keys to look up parents for. :return: A mapping from keys to parents. Absent keys are absent from the mapping. """ return self._get_parent_map_with_sources(keys)[0] def _get_parent_map_with_sources(self, keys): """Get a map of the parents of keys. :param keys: The keys to look up parents for. :return: A tuple. The first element is a mapping from keys to parents. Absent keys are absent from the mapping. The second element is a list with the locations each key was found in. The first element is the in-this-knit parents, the second the first fallback source, and so on. """ result = {} sources = [self._index] + self._immediate_fallback_vfs source_results = [] missing = set(keys) for source in sources: if not missing: break new_result = source.get_parent_map(missing) source_results.append(new_result) result.update(new_result) missing.difference_update(set(new_result)) return result, source_results def _get_record_map(self, keys, allow_missing=False): """Produce a dictionary of knit records. :return: {key:(record, record_details, digest, next)} * record: data returned from read_records (a KnitContentobject) * record_details: opaque information to pass to parse_record * digest: SHA1 digest of the full text after all steps are done * next: build-parent of the version, i.e. the leftmost ancestor. Will be None if the record is not a delta. :param keys: The keys to build a map for :param allow_missing: If some records are missing, rather than error, just return the data that could be generated. """ raw_map = self._get_record_map_unparsed(keys, allow_missing=allow_missing) return self._raw_map_to_record_map(raw_map) def _raw_map_to_record_map(self, raw_map): """Parse the contents of _get_record_map_unparsed. :return: see _get_record_map. """ result = {} for key in raw_map: data, record_details, next = raw_map[key] content, digest = self._parse_record(key[-1], data) result[key] = content, record_details, digest, next return result def _get_record_map_unparsed(self, keys, allow_missing=False): """Get the raw data for reconstructing keys without parsing it. :return: A dict suitable for parsing via _raw_map_to_record_map. key-> raw_bytes, (method, noeol), compression_parent """ # This retries the whole request if anything fails. Potentially we # could be a bit more selective. We could track the keys whose records # we have successfully found, and then only request the new records # from there. However, _get_components_positions grabs the whole build # chain, which means we'll likely try to grab the same records again # anyway. Also, can the build chains change as part of a pack # operation? We wouldn't want to end up with a broken chain. while True: try: position_map = self._get_components_positions( keys, allow_missing=allow_missing ) # key = component_id, r = record_details, i_m = index_memo, # n = next records = [(key, i_m) for key, (r, i_m, n) in position_map.items()] # Sort by the index memo, so that we request records from the # same pack file together, and in forward-sorted order records.sort(key=operator.itemgetter(1)) raw_record_map = {} for key, data in self._read_records_iter_unchecked(records): (record_details, _index_memo, next) = position_map[key] raw_record_map[key] = data, record_details, next return raw_record_map except pack_repo.RetryWithNewPacks as e: self._access.reload_or_raise(e) @classmethod def _split_by_prefix(cls, keys): """For the given keys, split them up based on their prefix. To keep memory pressure somewhat under control, split the requests back into per-file-id requests, otherwise "bzr co" extracts the full tree into memory before writing it to disk. This should be revisited if _get_content_maps() can ever cross file-id boundaries. The keys for a given file_id are kept in the same relative order. Ordering between file_ids is not, though prefix_order will return the order that the key was first seen. :param keys: An iterable of key tuples :return: (split_map, prefix_order) split_map A dictionary mapping prefix => keys prefix_order The order that we saw the various prefixes """ split_by_prefix = {} prefix_order = [] for key in keys: prefix = b"" if len(key) == 1 else key[0] if prefix in split_by_prefix: split_by_prefix[prefix].append(key) else: split_by_prefix[prefix] = [key] prefix_order.append(prefix) return split_by_prefix, prefix_order def _group_keys_for_io( self, keys, non_local_keys, positions, _min_buffer_size=_STREAM_MIN_BUFFER_SIZE ): """For the given keys, group them into 'best-sized' requests. The idea is to avoid making 1 request per file, but to never try to unpack an entire 1.5GB source tree in a single pass. Also when possible, we should try to group requests to the same pack file together. :return: list of (keys, non_local) tuples that indicate what keys should be fetched next. """ # TODO: Ideally we would group on 2 factors. We want to extract texts # from the same pack file together, and we want to extract all # the texts for a given build-chain together. Ultimately it # probably needs a better global view. len(keys) prefix_split_keys, prefix_order = self._split_by_prefix(keys) prefix_split_non_local_keys, _ = self._split_by_prefix(non_local_keys) cur_keys = [] cur_non_local = set() cur_size = 0 result = [] sizes = [] for prefix in prefix_order: keys = prefix_split_keys[prefix] non_local = prefix_split_non_local_keys.get(prefix, []) this_size = self._index._get_total_build_size(keys, positions) cur_size += this_size cur_keys.extend(keys) cur_non_local.update(non_local) if cur_size > _min_buffer_size: result.append((cur_keys, cur_non_local)) sizes.append(cur_size) cur_keys = [] cur_non_local = set() cur_size = 0 if cur_keys: result.append((cur_keys, cur_non_local)) sizes.append(cur_size) return result def get_record_stream(self, keys, ordering, include_delta_closure): """Get a stream of records for keys. :param keys: The keys to include. :param ordering: Either 'unordered' or 'topological'. A topologically sorted stream has compression parents strictly before their children. :param include_delta_closure: If True then the closure across any compression parents will be included (in the opaque data). :return: An iterator of ContentFactory objects, each of which is only valid until the iterator is advanced. """ # keys might be a generator keys = set(keys) if not keys: return if not self._index.has_graph: # Cannot sort when no graph has been stored. ordering = "unordered" remaining_keys = keys while True: try: keys = set(remaining_keys) for content_factory in self._get_remaining_record_stream( keys, ordering, include_delta_closure ): remaining_keys.discard(content_factory.key) yield content_factory return except pack_repo.RetryWithNewPacks as e: self._access.reload_or_raise(e) def _get_remaining_record_stream(self, keys, ordering, include_delta_closure): """This function is the 'retry' portion for get_record_stream.""" if include_delta_closure: positions = self._get_components_positions(keys, allow_missing=True) else: build_details = self._index.get_build_details(keys) # map from key to # (record_details, access_memo, compression_parent_key) positions = { key: self._build_details_to_components(details) for key, details in build_details.items() } absent_keys = keys.difference(set(positions)) # There may be more absent keys : if we're missing the basis component # and are trying to include the delta closure. # XXX: We should not ever need to examine remote sources because we do # not permit deltas across versioned files boundaries. if include_delta_closure: needed_from_fallback = set() # Build up reconstructable_keys dict. key:True in this dict means # the key can be reconstructed. reconstructable_keys = {} for key in keys: # the delta chain try: chain = [key, positions[key][2]] except KeyError: needed_from_fallback.add(key) continue result = True while chain[-1] is not None: if chain[-1] in reconstructable_keys: result = reconstructable_keys[chain[-1]] break else: try: chain.append(positions[chain[-1]][2]) except KeyError: # missing basis component needed_from_fallback.add(chain[-1]) result = True break for chain_key in chain[:-1]: reconstructable_keys[chain_key] = result if not result: needed_from_fallback.add(key) # Double index lookups here : need a unified api ? global_map, parent_maps = self._get_parent_map_with_sources(keys) if ordering in ("topological", "groupcompress"): if ordering == "topological": # Global topological sort present_keys = tsort.topo_sort(global_map) else: present_keys = sort_groupcompress(global_map) # Now group by source: source_keys = [] current_source = None for key in present_keys: for parent_map in parent_maps: if key in parent_map: key_source = parent_map break if current_source is not key_source: source_keys.append((key_source, [])) current_source = key_source source_keys[-1][1].append(key) else: if ordering != "unordered": raise AssertionError( "valid values for ordering are:" f' "unordered", "groupcompress" or "topological" not: {ordering!r}' ) # Just group by source; remote sources first. present_keys = [] source_keys = [] for parent_map in reversed(parent_maps): source_keys.append((parent_map, [])) for key in parent_map: present_keys.append(key) source_keys[-1][1].append(key) # We have been requested to return these records in an order that # suits us. So we ask the index to give us an optimally sorted # order. for source, sub_keys in source_keys: if source is parent_maps[0]: # Only sort the keys for this VF self._index._sort_keys_by_io(sub_keys, positions) absent_keys = keys - set(global_map) for key in absent_keys: yield AbsentContentFactory(key) # restrict our view to the keys we can answer. # XXX: Memory: TODO: batch data here to cap buffered data at (say) 1MB. # XXX: At that point we need to consider the impact of double reads by # utilising components multiple times. if include_delta_closure: # XXX: get_content_maps performs its own index queries; allow state # to be passed in. non_local_keys = needed_from_fallback - absent_keys for keys, non_local_keys in self._group_keys_for_io( # noqa: B020 present_keys, non_local_keys, positions, ): generator = _VFContentMapGenerator( self, keys, non_local_keys, global_map, ordering=ordering ) yield from generator.get_record_stream() else: for source, keys in source_keys: if source is parent_maps[0]: # this KnitVersionedFiles records = [(key, positions[key][1]) for key in keys] for key, raw_data in self._read_records_iter_unchecked(records): (record_details, _index_memo, _) = positions[key] yield KnitContentFactory( key, global_map[key], record_details, None, raw_data, self._factory.annotated, None, ) else: vf = self._immediate_fallback_vfs[parent_maps.index(source) - 1] yield from vf.get_record_stream( keys, ordering, include_delta_closure ) def get_sha1s(self, keys): """See VersionedFiles.get_sha1s().""" missing = set(keys) record_map = self._get_record_map(missing, allow_missing=True) result = {} for key, details in record_map.items(): if key not in missing: continue # record entry 2 is the 'digest'. result[key] = details[2] missing.difference_update(set(result)) for source in self._immediate_fallback_vfs: if not missing: break new_result = source.get_sha1s(missing) result.update(new_result) missing.difference_update(set(new_result)) return result def insert_record_stream(self, stream): """Insert a record stream into this container. :param stream: A stream of records to insert. :return: None :seealso VersionedFiles.get_record_stream: """ def get_adapter(adapter_key): try: return adapters[adapter_key] except KeyError: adapter_factory = adapter_registry.get(adapter_key) adapter = adapter_factory(self) adapters[adapter_key] = adapter return adapter delta_types = set() if self._factory.annotated: # self is annotated, we need annotated knits to use directly. annotated = "annotated-" convertibles = [] else: # self is not annotated, but we can strip annotations cheaply. annotated = "" convertibles = {"knit-annotated-ft-gz"} if self._max_delta_chain: delta_types.add("knit-annotated-delta-gz") convertibles.add("knit-annotated-delta-gz") # The set of types we can cheaply adapt without needing basis texts. native_types = set() if self._max_delta_chain: native_types.add(f"knit-{annotated}delta-gz") delta_types.add(f"knit-{annotated}delta-gz") native_types.add(f"knit-{annotated}ft-gz") knit_types = native_types.union(convertibles) adapters = {} # Buffer all index entries that we can't add immediately because their # basis parent is missing. We don't buffer all because generating # annotations may require access to some of the new records. However we # can't generate annotations from new deltas until their basis parent # is present anyway, so we get away with not needing an index that # includes the new keys. # # See about ordering of compression # parents in the records - to be conservative, we insist that all # parents must be present to avoid expanding to a fulltext. # # key = basis_parent, value = index entry to add buffered_index_entries = {} for record in stream: kind = record.storage_kind if kind.startswith("knit-") and kind.endswith("-gz"): # Check that the ID in the header of the raw knit bytes matches # the record metadata. raw_data = record._raw_record df, _rec = self._parse_record_header(record.key, raw_data) df.close() buffered = False parents = record.parents if record.storage_kind in delta_types: # TODO: eventually the record itself should track # compression_parent compression_parent = parents[0] else: compression_parent = None # Raise an error when a record is missing. if record.storage_kind == "absent": raise RevisionNotPresent([record.key], self) elif (record.storage_kind in knit_types) and ( compression_parent is None or not self._immediate_fallback_vfs or compression_parent in self._index or compression_parent not in self ): # we can insert the knit record literally if either it has no # compression parent OR we already have its basis in this kvf # OR the basis is not present even in the fallbacks. In the # last case it will either turn up later in the stream and all # will be well, or it won't turn up at all and we'll raise an # error at the end. # # TODO: self.__contains__ is somewhat redundant with # self._index.__contains__; we really want something that directly # asks if it's only present in the fallbacks. -- mbp 20081119 if record.storage_kind not in native_types: try: adapter_key = (record.storage_kind, "knit-delta-gz") adapter = get_adapter(adapter_key) except KeyError: adapter_key = (record.storage_kind, "knit-ft-gz") adapter = get_adapter(adapter_key) bytes = adapter.get_bytes(record, adapter_key[1]) else: # It's a knit record, it has a _raw_record field (even if # it was reconstituted from a network stream). bytes = record._raw_record options = [record._build_details[0].encode("ascii")] if record._build_details[1]: options.append(b"no-eol") # Just blat it across. # Note: This does end up adding data on duplicate keys. As # modern repositories use atomic insertions this should not # lead to excessive growth in the event of interrupted fetches. # 'knit' repositories may suffer excessive growth, but as a # deprecated format this is tolerable. It can be fixed if # needed by in the kndx index support raising on a duplicate # add with identical parents and options. access_memo = self._access.add_raw_record( record.key, len(bytes), [bytes] ) index_entry = (record.key, options, access_memo, parents) if b"fulltext" not in options: # Not a fulltext, so we need to make sure the compression # parent will also be present. # Note that pack backed knits don't need to buffer here # because they buffer all writes to the transaction level, # but we don't expose that difference at the index level. If # the query here has sufficient cost to show up in # profiling we should do that. # # They're required to be physically in this # KnitVersionedFiles, not in a fallback. if compression_parent not in self._index: pending = buffered_index_entries.setdefault( compression_parent, [] ) pending.append(index_entry) buffered = True if not buffered: self._index.add_records([index_entry]) elif record.storage_kind in ("chunked", "file"): self.add_lines(record.key, parents, record.get_bytes_as("lines")) else: # Not suitable for direct insertion as a # delta, either because it's not the right format, or this # KnitVersionedFiles doesn't permit deltas (_max_delta_chain == # 0) or because it depends on a base only present in the # fallback kvfs. self._access.flush() try: # Try getting a fulltext directly from the record. lines = record.get_bytes_as("lines") except UnavailableRepresentation: adapter_key = record.storage_kind, "lines" adapter = get_adapter(adapter_key) lines = adapter.get_bytes(record, "lines") with contextlib.suppress(RevisionAlreadyPresent): self.add_lines(record.key, parents, lines) # Add any records whose basis parent is now available. if not buffered: added_keys = [record.key] while added_keys: key = added_keys.pop(0) if key in buffered_index_entries: index_entries = buffered_index_entries[key] self._index.add_records(index_entries) added_keys.extend( [index_entry[0] for index_entry in index_entries] ) del buffered_index_entries[key] if buffered_index_entries: # There were index entries buffered at the end of the stream, # So these need to be added (if the index supports holding such # entries for later insertion) all_entries = [] for key in buffered_index_entries: index_entries = buffered_index_entries[key] all_entries.extend(index_entries) self._index.add_records(all_entries, missing_compression_parents=True) def get_missing_compression_parent_keys(self): """Return an iterable of keys of missing compression parents. Check this after calling insert_record_stream to find out if there are any missing compression parents. If there are, the records that depend on them are not able to be inserted safely. For atomic KnitVersionedFiles built on packs, the transaction should be aborted or suspended - commit will fail at this point. Nonatomic knits will error earlier because they have no staging area to put pending entries into. """ return self._index.get_missing_compression_parents() def iter_lines_added_or_present_in_keys(self, keys, pb=None): r"""Iterate over the lines in the versioned files from keys. This may return lines from other keys. Each item the returned iterator yields is a tuple of a line and a text version that that line is present in (not introduced in). Ordering of results is in whatever order is most suitable for the underlying storage format. If a progress bar is supplied, it may be used to indicate progress. The caller is responsible for cleaning up progress bars (because this is an iterator). Notes: * Lines are normalised by the underlying store: they will all have \n terminators. * Lines are returned in arbitrary order. * If a requested key did not change any lines (or didn't have any lines), it may not be mentioned at all in the result. :param pb: Progress bar supplied by caller. :return: An iterator over (line, key). """ if pb is None: class _NullProgressBar: def update(self, *args): pass def finished(self): pass pb = _NullProgressBar() keys = set(keys) total = len(keys) done = False while not done: try: # we don't care about inclusions, the caller cares. # but we need to setup a list of records to visit. # we need key, position, length key_records = [] build_details = self._index.get_build_details(keys) for key, details in build_details.items(): if key in keys: key_records.append((key, details[0])) records_iter = enumerate(self._read_records_iter(key_records)) for key_idx, (key, data, _sha_value) in records_iter: pb.update("Walking content", key_idx, total) compression_parent = build_details[key][1] if compression_parent is None: # fulltext line_iterator = self._factory.get_fulltext_content(data) else: # Delta line_iterator = self._factory.get_linedelta_content(data) # Now that we are yielding the data for this key, remove it # from the list keys.remove(key) # XXX: It might be more efficient to yield (key, # line_iterator) in the future. However for now, this is a # simpler change to integrate into the rest of the # codebase. RBC 20071110 for line in line_iterator: yield line, key done = True except pack_repo.RetryWithNewPacks as e: self._access.reload_or_raise(e) # If there are still keys we've not yet found, we look in the fallback # vfs, and hope to find them there. Note that if the keys are found # but had no changes or no content, the fallback may not return # anything. if keys and not self._immediate_fallback_vfs: # XXX: strictly the second parameter is meant to be the file id # but it's not easily accessible here. raise RevisionNotPresent(keys, repr(self)) for source in self._immediate_fallback_vfs: if not keys: break source_keys = set() for line, key in source.iter_lines_added_or_present_in_keys(keys): source_keys.add(key) yield line, key keys.difference_update(source_keys) pb.update("Walking content", total, total) def _make_line_delta(self, delta_seq, new_content): """Generate a line delta from delta_seq and new_content.""" diff_hunks = [] for op in delta_seq.get_opcodes(): if op[0] == "equal": continue diff_hunks.append( (op[1], op[2], op[4] - op[3], new_content._lines[op[3] : op[4]]) ) return diff_hunks def _merge_annotations( self, content, parents, parent_texts=None, delta=None, annotated=None, left_matching_blocks=None, ): """Merge annotations for content and generate deltas. This is done by comparing the annotations based on changes to the text and generating a delta on the resulting full texts. If annotations are not being created then a simple delta is created. """ if parent_texts is None: parent_texts = {} import patiencediff if left_matching_blocks is not None: delta_seq = diff._PrematchedMatcher(left_matching_blocks) else: delta_seq = None if annotated: for parent_key in parents: merge_content = self._get_content(parent_key, parent_texts) if parent_key == parents[0] and delta_seq is not None: seq = delta_seq else: seq = patiencediff.PatienceSequenceMatcher( None, merge_content.text(), content.text() ) for i, j, n in seq.get_matching_blocks(): if n == 0: continue # this copies (origin, text) pairs across to the new # content for any line that matches the last-checked # parent. content._lines[j : j + n] = merge_content._lines[i : i + n] # XXX: Robert says the following block is a workaround for a # now-fixed bug and it can probably be deleted. -- mbp 20080618 if content._lines and not content._lines[-1][1].endswith(b"\n"): # The copied annotation was from a line without a trailing EOL, # reinstate one for the content object, to ensure correct # serialization. line = content._lines[-1][1] + b"\n" content._lines[-1] = (content._lines[-1][0], line) if delta: if delta_seq is None: reference_content = self._get_content(parents[0], parent_texts) new_texts = content.text() old_texts = reference_content.text() delta_seq = patiencediff.PatienceSequenceMatcher( None, old_texts, new_texts ) return self._make_line_delta(delta_seq, content) def _parse_record(self, version_id, data): """Parse an original format knit record. These have the last element of the key only present in the stored data. """ rec, record_contents = self._parse_record_unchecked(data) self._check_header_version(rec, version_id) return record_contents, rec[3] def _parse_record_header(self, key, raw_data): """Parse a record header for consistency. :return: the header and the decompressor stream. as (stream, header_record) """ df = gzip.GzipFile(mode="rb", fileobj=BytesIO(raw_data)) try: # Current serialise rec = self._check_header(key, df.readline()) except Exception as e: raise KnitCorrupt( self, f"While reading {{{key}}} got {e.__class__.__name__}({e!s})" ) from e return df, rec def _parse_record_unchecked(self, data): # profiling notes: # 4168 calls in 2880 217 internal # 4168 calls to _parse_record_header in 2121 # 4168 calls to readlines in 330 with gzip.GzipFile(mode="rb", fileobj=BytesIO(data)) as df: try: record_contents = df.readlines() except Exception as e: raise KnitCorrupt( self, f"Corrupt compressed record {data!r}, got {e.__class__.__name__}({e!s})", ) from e header = record_contents.pop(0) rec = self._split_header(header) last_line = record_contents.pop() if len(record_contents) != int(rec[2]): raise KnitCorrupt( self, f"incorrect number of lines {len(record_contents)} != {int(rec[2])}" f" for version {{{rec[1]}}} {record_contents}", ) if last_line != b"end %s\n" % rec[1]: raise KnitCorrupt( self, f"unexpected version end line {last_line!r}, wanted {rec[1]!r}", ) return rec, record_contents def _read_records_iter(self, records): """Read text records from data file and yield result. The result will be returned in whatever is the fastest to read. Not by the order requested. Also, multiple requests for the same record will only yield 1 response. :param records: A list of (key, access_memo) entries :return: Yields (key, contents, digest) in the order read, not the order requested """ if not records: return # XXX: This smells wrong, IO may not be getting ordered right. needed_records = sorted(set(records), key=operator.itemgetter(1)) if not needed_records: return # The transport optimizes the fetching as well # (ie, reads continuous ranges.) raw_data = self._access.get_raw_records( [index_memo for key, index_memo in needed_records] ) for (key, _index_memo), data in zip(needed_records, raw_data, strict=False): content, digest = self._parse_record(key[-1], data) yield key, content, digest def _read_records_iter_raw(self, records): """Read text records from data file and yield raw data. This unpacks enough of the text record to validate the id is as expected but thats all. Each item the iterator yields is (key, bytes, expected_sha1_of_full_text). """ for key, data in self._read_records_iter_unchecked(records): # validate the header (note that we can only use the suffix in # current knit records). df, rec = self._parse_record_header(key, data) df.close() yield key, data, rec[3] def _read_records_iter_unchecked(self, records): """Read text records from data file and yield raw data. No validation is done. Yields tuples of (key, data). """ # setup an iterator of the external records: # uses readv so nice and fast we hope. if len(records): # grab the disk data needed. needed_offsets = [index_memo for key, index_memo in records] raw_records = self._access.get_raw_records(needed_offsets) for key, _index_memo in records: data = next(raw_records) yield key, data def _record_to_data(self, key, digest, lines, dense_lines=None): r"""Convert key, digest, lines into a raw data block. :param key: The key of the record. Currently keys are always serialised using just the trailing component. :param dense_lines: The bytes of lines but in a denser form. For instance, if lines is a list of 1000 bytestrings each ending in \n, dense_lines may be a list with one line in it, containing all the 1000's lines and their \n's. Using dense_lines if it is already known is a win because the string join to create bytes in this function spends less time resizing the final string. :return: (len, chunked bytestring with compressed data) """ chunks = [b"version %s %d %s\n" % (key[-1], len(lines), digest)] chunks.extend(dense_lines or lines) chunks.append(b"end " + key[-1] + b"\n") for chunk in chunks: if not isinstance(chunk, bytes): raise AssertionError(f"data must be plain bytes was {type(chunk)}") if lines and not lines[-1].endswith(b"\n"): raise ValueError(f"corrupt lines value {lines!r}") compressed_chunks = tuned_gzip.chunks_to_gzip(chunks) return sum(map(len, compressed_chunks)), compressed_chunks def _split_header(self, line): rec = line.split() if len(rec) != 4: raise KnitCorrupt(self, "unexpected number of elements in record header") return rec def keys(self): """See VersionedFiles.keys.""" evil_logger.debug("keys scales with size of history") sources = [self._index] + self._immediate_fallback_vfs result = set() for source in sources: result.update(source.keys()) return result class _ContentMapGenerator: """Generate texts or expose raw deltas for a set of texts.""" def __init__(self, ordering="unordered"): self._ordering = ordering def _get_content(self, key): """Get the content object for key.""" # Note that _get_content is only called when the _ContentMapGenerator # has been constructed with just one key requested for reconstruction. if key in self.nonlocal_keys: record = next(self.get_record_stream()) # Create a content object on the fly lines = record.get_bytes_as("lines") return PlainKnitContent(lines, record.key) else: # local keys we can ask for directly return self._get_one_work(key) def get_record_stream(self): """Get a record stream for the keys requested during __init__.""" yield from self._work() def _work(self): """Produce maps of text and KnitContents as dicts. :return: (text_map, content_map) where text_map contains the texts for the requested versions and content_map contains the KnitContents. """ # NB: By definition we never need to read remote sources unless texts # are requested from them: we don't delta across stores - and we # explicitly do not want to to prevent data loss situations. if self.global_map is None: self.global_map = self.vf.get_parent_map(self.keys) nonlocal_keys = self.nonlocal_keys missing_keys = set(nonlocal_keys) # Read from remote versioned file instances and provide to our caller. for source in self.vf._immediate_fallback_vfs: if not missing_keys: break # Loop over fallback repositories asking them for texts - ignore # any missing from a particular fallback. for record in source.get_record_stream(missing_keys, self._ordering, True): if record.storage_kind == "absent": # Not in thie particular stream, may be in one of the # other fallback vfs objects. continue missing_keys.remove(record.key) yield record if self._raw_record_map is None: raise AssertionError("_raw_record_map should have been filled") first = True for key in self.keys: if key in self.nonlocal_keys: continue yield LazyKnitContentFactory(key, self.global_map[key], self, first) first = False def _get_one_work(self, requested_key): # Now, if we have calculated everything already, just return the # desired text. if requested_key in self._contents_map: return self._contents_map[requested_key] # To simplify things, parse everything at once - code that wants one text # probably wants them all. # FUTURE: This function could be improved for the 'extract many' case # by tracking each component and only doing the copy when the number of # children than need to apply delta's to it is > 1 or it is part of the # final output. multiple_versions = len(self.keys) != 1 if self._record_map is None: self._record_map = self.vf._raw_map_to_record_map(self._raw_record_map) record_map = self._record_map # raw_record_map is key: # Have read and parsed records at this point. for key in self.keys: if key in self.nonlocal_keys: # already handled continue components = [] cursor = key while cursor is not None: try: record, record_details, digest, next = record_map[cursor] except KeyError as e: raise RevisionNotPresent(cursor, self) from e components.append((cursor, record, record_details, digest)) cursor = next if cursor in self._contents_map: # no need to plan further back components.append((cursor, None, None, None)) break content = None for component_id, record, record_details, digest in reversed(components): # noqa: B007 if component_id in self._contents_map: content = self._contents_map[component_id] else: content, _delta = self._factory.parse_record( key[-1], record, record_details, content, copy_base_content=multiple_versions, ) if multiple_versions: self._contents_map[component_id] = content # digest here is the digest from the last applied component. text = content.text() actual_sha = sha_strings(text) if actual_sha != digest: raise SHA1KnitCorrupt(self, actual_sha, digest, key, text) if multiple_versions: return self._contents_map[requested_key] else: return content def _wire_bytes(self): """Get the bytes to put on the wire for 'key'. The first collection of bytes asked for returns the serialised raw_record_map and the additional details (key, parent) for key. Subsequent calls return just the additional details (key, parent). The wire storage_kind given for the first key is 'knit-delta-closure', For subsequent keys it is 'knit-delta-closure-ref'. :param key: A key from the content generator. :return: Bytes to put on the wire. """ lines = [] # kind marker for dispatch on the far side, lines.append(b"knit-delta-closure") # Annotated or not if self.vf._factory.annotated: lines.append(b"annotated") else: lines.append(b"") # then the list of keys lines.append( b"\t".join( b"\x00".join(key) for key in self.keys if key not in self.nonlocal_keys ) ) # then the _raw_record_map in serialised form: map_byte_list = [] # for each item in the map: # 1 line with key # 1 line with parents if the key is to be yielded (None: for None, '' for ()) # one line with method # one line with noeol # one line with next ('' for None) # one line with byte count of the record bytes # the record bytes for key, (record_bytes, (method, noeol), next) in self._raw_record_map.items(): key_bytes = b"\x00".join(key) parents = self.global_map.get(key, None) if parents is None: parent_bytes = b"None:" else: parent_bytes = b"\t".join(b"\x00".join(key) for key in parents) method_bytes = method.encode("ascii") noeol_bytes = b"T" if noeol else b"F" next_bytes = b"\x00".join(next) if next else b"" map_byte_list.append( b"\n".join( [ key_bytes, parent_bytes, method_bytes, noeol_bytes, next_bytes, b"%d" % len(record_bytes), record_bytes, ] ) ) map_bytes = b"".join(map_byte_list) lines.append(map_bytes) bytes = b"\n".join(lines) return bytes class _VFContentMapGenerator(_ContentMapGenerator): """Content map generator reading from a VersionedFiles object.""" def __init__( self, versioned_files, keys, nonlocal_keys=None, global_map=None, raw_record_map=None, ordering="unordered", ): """Create a _ContentMapGenerator. :param versioned_files: The versioned files that the texts are being extracted from. :param keys: The keys to produce content maps for. :param nonlocal_keys: An iterable of keys(possibly intersecting keys) which are known to not be in this knit, but rather in one of the fallback knits. :param global_map: The result of get_parent_map(keys) (or a supermap). This is required if get_record_stream() is to be used. :param raw_record_map: A unparsed raw record map to use for answering contents. """ _ContentMapGenerator.__init__(self, ordering=ordering) # The vf to source data from self.vf = versioned_files # The keys desired self.keys = list(keys) # Keys known to be in fallback vfs objects if nonlocal_keys is None: self.nonlocal_keys = set() else: self.nonlocal_keys = frozenset(nonlocal_keys) # Parents data for keys to be returned in get_record_stream self.global_map = global_map # The chunked lists for self.keys in text form self._text_map = {} # A cache of KnitContent objects used in extracting texts. self._contents_map = {} # All the knit records needed to assemble the requested keys as full # texts. self._record_map = None if raw_record_map is None: self._raw_record_map = self.vf._get_record_map_unparsed( keys, allow_missing=True ) else: self._raw_record_map = raw_record_map # the factory for parsing records self._factory = self.vf._factory class _NetworkContentMapGenerator(_ContentMapGenerator): """Content map generator sourced from a network stream.""" def __init__(self, bytes, line_end): """Construct a _NetworkContentMapGenerator from a bytes block.""" self._bytes = bytes self.global_map = {} self._raw_record_map = {} self._contents_map = {} self._record_map = None self.nonlocal_keys = [] # Get access to record parsing facilities self.vf = KnitVersionedFiles(None, None) start = line_end # Annotated or not line_end = bytes.find(b"\n", start) line = bytes[start:line_end] start = line_end + 1 if line == b"annotated": self._factory = KnitAnnotateFactory() else: self._factory = KnitPlainFactory() # list of keys to emit in get_record_stream line_end = bytes.find(b"\n", start) line = bytes[start:line_end] start = line_end + 1 self.keys = [ tuple(segment.split(b"\x00")) for segment in line.split(b"\t") if segment ] # now a loop until the end. XXX: It would be nice if this was just a # bunch of the same records as get_record_stream(..., False) gives, but # there is a decent sized gap stopping that at the moment. end = len(bytes) while start < end: # 1 line with key line_end = bytes.find(b"\n", start) key = tuple(bytes[start:line_end].split(b"\x00")) start = line_end + 1 # 1 line with parents (None: for None, '' for ()) line_end = bytes.find(b"\n", start) line = bytes[start:line_end] if line == b"None:": parents = None else: parents = tuple( tuple(segment.split(b"\x00")) for segment in line.split(b"\t") if segment ) self.global_map[key] = parents start = line_end + 1 # one line with method line_end = bytes.find(b"\n", start) line = bytes[start:line_end] method = line.decode("ascii") start = line_end + 1 # one line with noeol line_end = bytes.find(b"\n", start) line = bytes[start:line_end] noeol = line == b"T" start = line_end + 1 # one line with next (b'' for None) line_end = bytes.find(b"\n", start) line = bytes[start:line_end] next = None if not line else tuple(bytes[start:line_end].split(b"\x00")) start = line_end + 1 # one line with byte count of the record bytes line_end = bytes.find(b"\n", start) line = bytes[start:line_end] count = int(line) start = line_end + 1 # the record bytes record_bytes = bytes[start : start + count] start = start + count # put it in the map self._raw_record_map[key] = (record_bytes, (method, noeol), next) def get_record_stream(self): """Get a record stream for for keys requested by the bytestream.""" first = True for key in self.keys: yield LazyKnitContentFactory(key, self.global_map[key], self, first) first = False def _wire_bytes(self): return self._bytes class _KndxIndex: r"""Manages knit index files. The index is kept in memory and read on startup, to enable fast lookups of revision information. The cursor of the index file is always pointing to the end, making it easy to append entries. _cache is a cache for fast mapping from version id to a Index object. _history is a cache for fast mapping from indexes to version ids. The index data format is dictionary compressed when it comes to parent references; a index entry may only have parents that with a lover index number. As a result, the index is topological sorted. Duplicate entries may be written to the index for a single version id if this is done then the latter one completely replaces the former: this allows updates to correct version and parent information. Note that the two entries may share the delta, and that successive annotations and references MUST point to the first entry. The index file on disc contains a header, followed by one line per knit record. The same revision can be present in an index file more than once. The first occurrence gets assigned a sequence number starting from 0. The format of a single line is REVISION_ID FLAGS BYTE_OFFSET LENGTH( PARENT_ID|PARENT_SEQUENCE_ID)* :\n REVISION_ID is a utf8-encoded revision id FLAGS is a comma separated list of flags about the record. Values include no-eol, line-delta, fulltext. BYTE_OFFSET is the ascii representation of the byte offset in the data file that the compressed data starts at. LENGTH is the ascii representation of the length of the data file. PARENT_ID a utf-8 revision id prefixed by a '.' that is a parent of REVISION_ID. PARENT_SEQUENCE_ID the ascii representation of the sequence number of a revision id already in the knit that is a parent of REVISION_ID. The ' :' marker is the end of record marker. partial writes: when a write is interrupted to the index file, it will result in a line that does not end in ' :'. If the ' :' is not present at the end of a line, or at the end of the file, then the record that is missing it will be ignored by the parser. When writing new records to the index file, the data is preceded by '\n' to ensure that records always start on new lines even if the last write was interrupted. As a result its normal for the last line in the index to be missing a trailing newline. One can be added with no harmful effects. :ivar _kndx_cache: dict from prefix to the old state of KnitIndex objects, where prefix is e.g. the (fileid,) for .texts instances or () for constant-mapped things like .revisions, and the old state is tuple(history_vector, cache_dict). This is used to prevent having an ABI change with the C extension that reads .kndx files. """ HEADER = b"# bzr knit index 8\n" def __init__(self, transport, mapper, get_scope, allow_writes, is_locked): """Create a _KndxIndex on transport using mapper.""" self._transport = transport self._mapper = mapper self._get_scope = get_scope self._allow_writes = allow_writes self._is_locked = is_locked self._reset_cache() self.has_graph = True def add_records(self, records, random_id=False, missing_compression_parents=False): """Add multiple records to the index. :param records: a list of tuples: (key, options, access_memo, parents). :param random_id: If True the ids being added were randomly generated and no check for existence will be performed. :param missing_compression_parents: If True the records being added are only compressed against texts already in the index (or inside records). If False the records all refer to unavailable texts (or texts inside records) as compression parents. """ if missing_compression_parents: # It might be nice to get the edge of the records. But keys isn't # _wrong_. keys = sorted(record[0] for record in records) raise RevisionNotPresent(keys, self) paths = {} for record in records: key = record[0] prefix = key[:-1] path = self._mapper.map(key) + ".kndx" path_keys = paths.setdefault(path, (prefix, [])) path_keys[1].append(record) for path in sorted(paths): prefix, path_keys = paths[path] self._load_prefixes([prefix]) lines = [] orig_history = self._kndx_cache[prefix][1][:] orig_cache = self._kndx_cache[prefix][0].copy() try: for key, options, (_, pos, size), parents in path_keys: if not all(isinstance(option, bytes) for option in options): raise TypeError(options) if parents is None: # kndx indices cannot be parentless. parents = () line = b" ".join( [ b"\n" + key[-1], b",".join(options), b"%d" % pos, b"%d" % size, self._dictionary_compress(parents), b":", ] ) if not isinstance(line, bytes): raise AssertionError(f"data must be utf8 was {type(line)}") lines.append(line) self._cache_key(key, options, pos, size, parents) if len(orig_history): self._transport.append_bytes(path, b"".join(lines)) else: self._init_index(path, lines) except: # If any problems happen, restore the original values and re-raise self._kndx_cache[prefix] = (orig_cache, orig_history) raise def scan_unvalidated_index(self, graph_index): """See _KnitGraphIndex.scan_unvalidated_index.""" # Because kndx files do not support atomic insertion via separate index # files, they do not support this method. raise NotImplementedError(self.scan_unvalidated_index) def get_missing_compression_parents(self): """See _KnitGraphIndex.get_missing_compression_parents.""" # Because kndx files do not support atomic insertion via separate index # files, they do not support this method. raise NotImplementedError(self.get_missing_compression_parents) def _cache_key(self, key, options, pos, size, parent_keys): """Cache a version record in the history array and index cache. This is inlined into _load_data for performance. KEEP IN SYNC. (It saves 60ms, 25% of the __init__ overhead on local 4000 record indexes). """ prefix = key[:-1] version_id = key[-1] # last-element only for compatibilty with the C load_data. parents = tuple(parent[-1] for parent in parent_keys) for parent in parent_keys: if parent[:-1] != prefix: raise ValueError(f"mismatched prefixes for {key!r}, {parent_keys!r}") cache, history = self._kndx_cache[prefix] # only want the _history index to reference the 1st index entry # for version_id if version_id not in cache: index = len(history) history.append(version_id) else: index = cache[version_id][5] cache[version_id] = (version_id, options, pos, size, parents, index) def check_header(self, fp): line = fp.readline() if line == b"": # An empty file can actually be treated as though the file doesn't # exist yet. raise NoSuchFile(self) if line != self.HEADER: raise KnitHeaderError(badline=line, filename=self) def _check_read(self): if not self._is_locked(): raise ObjectNotLocked(self) if self._get_scope() != self._scope: self._reset_cache() def _check_write_ok(self): """Assert if not writes are permitted.""" if not self._is_locked(): raise ObjectNotLocked(self) if self._get_scope() != self._scope: self._reset_cache() if self._mode != "w": raise ReadOnlyObjectDirtiedError(self) def get_build_details(self, keys): """Get the method, index_memo and compression parent for keys. Ghosts are omitted from the result. :param keys: An iterable of keys. :return: A dict of key:(index_memo, compression_parent, parents, record_details). index_memo opaque structure to pass to read_records to extract the raw data compression_parent Content that this record is built upon, may be None parents Logical parents of this node record_details extra information about the content which needs to be passed to Factory.parse_record """ parent_map = self.get_parent_map(keys) result = {} for key in keys: if key not in parent_map: continue # Ghost method = self.get_method(key) if not isinstance(method, str): raise TypeError(method) parents = parent_map[key] compression_parent = None if method == "fulltext" else parents[0] noeol = b"no-eol" in self.get_options(key) index_memo = self.get_position(key) result[key] = (index_memo, compression_parent, parents, (method, noeol)) return result def get_method(self, key): """Return compression method of specified key.""" options = self.get_options(key) if b"fulltext" in options: return "fulltext" elif b"line-delta" in options: return "line-delta" else: raise KnitIndexUnknownMethod(self, options) def get_options(self, key): """Return a list representing options. e.g. ['foo', 'bar'] """ prefix, suffix = self._split_key(key) self._load_prefixes([prefix]) try: return self._kndx_cache[prefix][0][suffix][1] except KeyError as e: raise RevisionNotPresent(key, self) from e def find_ancestry(self, keys): """See CombinedGraphIndex.find_ancestry().""" prefixes = {key[:-1] for key in keys} self._load_prefixes(prefixes) parent_map = {} missing_keys = set() pending_keys = list(keys) # This assumes that keys will not reference parents in a different # prefix, which is accurate so far. while pending_keys: key = pending_keys.pop() if key in parent_map: continue prefix = key[:-1] try: suffix_parents = self._kndx_cache[prefix][0][key[-1]][4] except KeyError: missing_keys.add(key) else: parent_keys = tuple([prefix + (suffix,) for suffix in suffix_parents]) parent_map[key] = parent_keys pending_keys.extend([p for p in parent_keys if p not in parent_map]) return parent_map, missing_keys def get_parent_map(self, keys): """Get a map of the parents of keys. :param keys: The keys to look up parents for. :return: A mapping from keys to parents. Absent keys are absent from the mapping. """ # Parse what we need to up front, this potentially trades off I/O # locality (.kndx and .knit in the same block group for the same file # id) for less checking in inner loops. prefixes = {key[:-1] for key in keys} self._load_prefixes(prefixes) result = {} for key in keys: prefix = key[:-1] try: suffix_parents = self._kndx_cache[prefix][0][key[-1]][4] except KeyError: pass else: result[key] = tuple(prefix + (suffix,) for suffix in suffix_parents) return result def get_position(self, key): """Return details needed to access the version. :return: a tuple (key, data position, size) to hand to the access logic to get the record. """ prefix, suffix = self._split_key(key) self._load_prefixes([prefix]) entry = self._kndx_cache[prefix][0][suffix] return key, entry[2], entry[3] __contains__ = _mod_index._has_key_from_parent_map def _init_index(self, path, extra_lines=None): """Initialize an index.""" if extra_lines is None: extra_lines = [] sio = BytesIO() sio.write(self.HEADER) sio.writelines(extra_lines) sio.seek(0) self._transport.put_file_non_atomic(path, sio, create_parent_dir=True) # self._create_parent_dir) # mode=self._file_mode, # dir_mode=self._dir_mode) def keys(self): """Get all the keys in the collection. The keys are not ordered. """ result = set() # Identify all key prefixes. # XXX: A bit hacky, needs polish. if isinstance(self._mapper, ConstantMapper): prefixes = [()] else: relpaths = set() for quoted_relpath in self._transport.iter_files_recursive(): path, _ext = os.path.splitext(quoted_relpath) relpaths.add(path) prefixes = [self._mapper.unmap(path) for path in relpaths] self._load_prefixes(prefixes) for prefix in prefixes: for suffix in self._kndx_cache[prefix][1]: result.add(prefix + (suffix,)) return result def _load_prefixes(self, prefixes): """Load the indices for prefixes.""" self._check_read() for prefix in prefixes: if prefix not in self._kndx_cache: # the load_data interface writes to these variables. self._cache = {} self._history = [] self._filename = prefix try: path = self._mapper.map(prefix) + ".kndx" with self._transport.get(path) as fp: # _load_data may raise NoSuchFile if the target knit is # completely empty. _load_data(self, fp) self._kndx_cache[prefix] = (self._cache, self._history) del self._cache del self._filename del self._history except TransportNoSuchFile: self._kndx_cache[prefix] = ({}, []) if isinstance(self._mapper, ConstantMapper): # preserve behaviour for revisions.kndx etc. self._init_index(path) del self._cache del self._filename del self._history missing_keys = _mod_index._missing_keys_from_parent_map def _partition_keys(self, keys): """Turn keys into a dict of prefix:suffix_list.""" result = {} for key in keys: prefix_keys = result.setdefault(key[:-1], []) prefix_keys.append(key[-1]) return result def _dictionary_compress(self, keys): """Dictionary compress keys. :param keys: The keys to generate references to. :return: A string representation of keys. keys which are present are dictionary compressed, and others are emitted as fulltext with a '.' prefix. """ if not keys: return b"" result_list = [] prefix = keys[0][:-1] cache = self._kndx_cache[prefix][0] for key in keys: if key[:-1] != prefix: # kndx indices cannot refer across partitioned storage. raise ValueError(f"mismatched prefixes for {keys!r}") if key[-1] in cache: # -- inlined lookup() -- result_list.append(b"%d" % cache[key[-1]][5]) # -- end lookup () -- else: result_list.append(b"." + key[-1]) return b" ".join(result_list) def _reset_cache(self): # Possibly this should be a LRU cache. A dictionary from key_prefix to # (cache_dict, history_vector) for parsed kndx files. self._kndx_cache = {} self._scope = self._get_scope() allow_writes = self._allow_writes() if allow_writes: self._mode = "w" else: self._mode = "r" def _sort_keys_by_io(self, keys, positions): """Figure out an optimal order to read the records for the given keys. Sort keys, grouped by index and sorted by position. :param keys: A list of keys whose records we want to read. This will be sorted 'in-place'. :param positions: A dict, such as the one returned by _get_components_positions() :return: None """ def get_sort_key(key): index_memo = positions[key][1] # Group by prefix and position. index_memo[0] is the key, so it is # (file_id, revision_id) and we don't want to sort on revision_id, # index_memo[1] is the position, and index_memo[2] is the size, # which doesn't matter for the sort return index_memo[0][:-1], index_memo[1] return keys.sort(key=get_sort_key) _get_total_build_size = _get_total_build_size def _split_key(self, key): """Split key into a prefix and suffix.""" # GZ 2018-07-03: This is intentionally either a sequence or bytes? if isinstance(key, bytes): return key[:-1], key[-1:] return key[:-1], key[-1] def as_tuples(obj): """Ensure that the object and any referenced objects are plain tuples. :param obj: a list or a tuple :return: a plain tuple instance, with all children also being tuples. """ result = [] for item in obj: if isinstance(item, (tuple, list)): item = as_tuples(item) result.append(item) return tuple(result) class _KnitGraphIndex: """A KnitVersionedFiles index layered on GraphIndex.""" def __init__( self, graph_index, is_locked, deltas=False, parents=True, add_callback=None, track_external_parent_refs=False, ): """Construct a KnitGraphIndex on a graph_index. :param graph_index: An implementation of bzrformts.index.GraphIndex. :param is_locked: A callback to check whether the object should answer queries. :param deltas: Allow delta-compressed records. :param parents: If True, record knits parents, if not do not record parents. :param add_callback: If not None, allow additions to the index and call this callback with a list of added GraphIndex nodes: [(node, value, node_refs), ...] :param is_locked: A callback, returns True if the index is locked and thus usable. :param track_external_parent_refs: If True, record all external parent references parents from added records. These can be retrieved later by calling get_missing_parents(). """ self._add_callback = add_callback self._graph_index = graph_index self._deltas = deltas self._parents = parents if deltas and not parents: # XXX: TODO: Delta tree and parent graph should be conceptually # separate. raise KnitCorrupt( self, "Cannot do delta compression without parent tracking." ) self.has_graph = parents self._is_locked = is_locked self._missing_compression_parents = set() if track_external_parent_refs: self._key_dependencies = _KeyRefs() else: self._key_dependencies = None def __repr__(self): return f"{self.__class__.__name__}({self._graph_index!r})" def add_records(self, records, random_id=False, missing_compression_parents=False): """Add multiple records to the index. This function does not insert data into the Immutable GraphIndex backing the KnitGraphIndex, instead it prepares data for insertion by the caller and checks that it is safe to insert then calls self._add_callback with the prepared GraphIndex nodes. :param records: a list of tuples: (key, options, access_memo, parents). :param random_id: If True the ids being added were randomly generated and no check for existence will be performed. :param missing_compression_parents: If True the records being added are only compressed against texts already in the index (or inside records). If False the records all refer to unavailable texts (or texts inside records) as compression parents. """ if not self._add_callback: raise ReadOnlyError(self) # we hope there are no repositories with inconsistent parentage # anymore. keys = {} compression_parents = set() key_dependencies = self._key_dependencies for key, options, access_memo, parents in records: if self._parents: parents = tuple(parents) if key_dependencies is not None: key_dependencies.add_references(key, parents) _index, pos, size = access_memo value = b"N" if b"no-eol" in options else b" " value += b"%d %d" % (pos, size) if not self._deltas and b"line-delta" in options: raise KnitCorrupt(self, "attempt to add line-delta in non-delta knit") if self._parents: if self._deltas: if b"line-delta" in options: node_refs = (parents, (parents[0],)) if missing_compression_parents: compression_parents.add(parents[0]) else: node_refs = (parents, ()) else: node_refs = (parents,) else: if parents: raise KnitCorrupt( self, "attempt to add node with parents in parentless index." ) node_refs = () keys[key] = (value, node_refs) # check for dups if not random_id: present_nodes = self._get_entries(keys) for _index, key, value, node_refs in present_nodes: parents = node_refs[:1] # Sometimes these are passed as a list rather than a tuple passed = as_tuples(keys[key]) passed_parents = passed[1][:1] if value[0:1] != keys[key][0][0:1] or parents != passed_parents: node_refs = as_tuples(node_refs) raise KnitCorrupt( self, "inconsistent details in add_records" f": {(value, node_refs)} {passed}", ) del keys[key] result = [] if self._parents: for key, (value, node_refs) in keys.items(): result.append((key, value, node_refs)) else: for key, (value, node_refs) in keys.items(): # noqa: B007 result.append((key, value)) self._add_callback(result) if missing_compression_parents: # This may appear to be incorrect (it does not check for # compression parents that are in the existing graph index), # but such records won't have been buffered, so this is # actually correct: every entry when # missing_compression_parents==True either has a missing parent, or # a parent that is one of the keys in records. compression_parents.difference_update(keys) self._missing_compression_parents.update(compression_parents) # Adding records may have satisfied missing compression parents. self._missing_compression_parents.difference_update(keys) def scan_unvalidated_index(self, graph_index): """Inform this _KnitGraphIndex that there is an unvalidated index. This allows this _KnitGraphIndex to keep track of any missing compression parents we may want to have filled in to make those indices valid. :param graph_index: A GraphIndex """ if self._deltas: new_missing = graph_index.external_references(ref_list_num=1) new_missing.difference_update(self.get_parent_map(new_missing)) self._missing_compression_parents.update(new_missing) if self._key_dependencies is not None: # Add parent refs from graph_index (and discard parent refs that # the graph_index has). for node in graph_index.iter_all_entries(): self._key_dependencies.add_references(node[1], node[3][0]) def get_missing_compression_parents(self): """Return the keys of missing compression parents. Missing compression parents occur when a record stream was missing basis texts, or a index was scanned that had missing basis texts. """ return frozenset(self._missing_compression_parents) def get_missing_parents(self): """Return the keys of missing parents.""" # If updating this, you should also update # groupcompress._GCGraphIndex.get_missing_parents # We may have false positives, so filter those out. self._key_dependencies.satisfy_refs_for_keys( self.get_parent_map(self._key_dependencies.get_unsatisfied_refs()) ) return frozenset(self._key_dependencies.get_unsatisfied_refs()) def _check_read(self): """Raise if reads are not permitted.""" if not self._is_locked(): raise ObjectNotLocked(self) def _check_write_ok(self): """Assert if writes are not permitted.""" if not self._is_locked(): raise ObjectNotLocked(self) def _compression_parent(self, an_entry): # return the key that an_entry is compressed against, or None # Grab the second parent list (as deltas implies parents currently) compression_parents = an_entry[3][1] if not compression_parents: return None if len(compression_parents) != 1: raise AssertionError( f"Too many compression parents: {compression_parents!r}" ) return compression_parents[0] def get_build_details(self, keys): """Get the method, index_memo and compression parent for version_ids. Ghosts are omitted from the result. :param keys: An iterable of keys. :return: A dict of key: (index_memo, compression_parent, parents, record_details). index_memo opaque structure to pass to read_records to extract the raw data compression_parent Content that this record is built upon, may be None parents Logical parents of this node record_details extra information about the content which needs to be passed to Factory.parse_record """ self._check_read() result = {} entries = self._get_entries(keys, False) for entry in entries: key = entry[1] parents = () if not self._parents else entry[3][0] if not self._deltas: compression_parent_key = None else: compression_parent_key = self._compression_parent(entry) noeol = entry[2][0:1] == b"N" method = "line-delta" if compression_parent_key else "fulltext" result[key] = ( self._node_to_position(entry), compression_parent_key, parents, (method, noeol), ) return result def _get_entries(self, keys, check_present=False): """Get the entries for keys. :param keys: An iterable of index key tuples. """ keys = set(keys) found_keys = set() if self._parents: for node in self._graph_index.iter_entries(keys): yield node found_keys.add(node[1]) else: # adapt parentless index to the rest of the code. for node in self._graph_index.iter_entries(keys): yield node[0], node[1], node[2], () found_keys.add(node[1]) if check_present: missing_keys = keys.difference(found_keys) if missing_keys: raise RevisionNotPresent(missing_keys.pop(), self) def get_method(self, key): """Return compression method of specified key.""" return self._get_method(self._get_node(key)) def _get_method(self, node): if not self._deltas: return "fulltext" if self._compression_parent(node): return "line-delta" else: return "fulltext" def _get_node(self, key): try: return list(self._get_entries([key]))[0] except IndexError as e: raise RevisionNotPresent(key, self) from e def get_options(self, key): """Return a list representing options. e.g. ['foo', 'bar'] """ node = self._get_node(key) options = [self._get_method(node).encode("ascii")] if node[2][0:1] == b"N": options.append(b"no-eol") return options def find_ancestry(self, keys): """See CombinedGraphIndex.find_ancestry().""" return self._graph_index.find_ancestry(keys, 0) def get_parent_map(self, keys): """Get a map of the parents of keys. :param keys: The keys to look up parents for. :return: A mapping from keys to parents. Absent keys are absent from the mapping. """ self._check_read() nodes = self._get_entries(keys) result = {} if self._parents: for node in nodes: result[node[1]] = node[3][0] else: for node in nodes: result[node[1]] = None return result def get_position(self, key): """Return details needed to access the version. :return: a tuple (index, data position, size) to hand to the access logic to get the record. """ node = self._get_node(key) return self._node_to_position(node) __contains__ = _mod_index._has_key_from_parent_map def keys(self): """Get all the keys in the collection. The keys are not ordered. """ self._check_read() return [node[1] for node in self._graph_index.iter_all_entries()] missing_keys = _mod_index._missing_keys_from_parent_map def _node_to_position(self, node): """Convert an index value to position details.""" bits = node[2][1:].split(b" ") return node[0], int(bits[0]), int(bits[1]) def _sort_keys_by_io(self, keys, positions): """Figure out an optimal order to read the records for the given keys. Sort keys, grouped by index and sorted by position. :param keys: A list of keys whose records we want to read. This will be sorted 'in-place'. :param positions: A dict, such as the one returned by _get_components_positions() :return: None """ def get_index_memo(key): # index_memo is at offset [1]. It is made up of (GraphIndex, # position, size). GI is an object, which will be unique for each # pack file. This causes us to group by pack file, then sort by # position. Size doesn't matter, but it isn't worth breaking up the # tuple. return positions[key][1] return keys.sort(key=get_index_memo) _get_total_build_size = _get_total_build_size class _KnitKeyAccess: """Access to records in .knit files.""" def __init__(self, transport, mapper): """Create a _KnitKeyAccess with transport and mapper. :param transport: The transport the access object is rooted at. :param mapper: The mapper used to map keys to .knit files. """ self._transport = transport self._mapper = mapper def add_raw_record(self, key, size, raw_data): """Add raw knit bytes to a storage area. The data is spooled to the container writer in one bytes-record per raw data item. :param key: The key of the raw data segment :param size: The size of the raw data segment :param raw_data: A chunked bytestring containing the data. :return: opaque index memo to retrieve the record later. For _KnitKeyAccess the memo is (key, pos, length), where the key is the record key. """ path = self._mapper.map(key) try: base = self._transport.append_bytes(path + ".knit", b"".join(raw_data)) except TransportNoSuchFile: self._transport.mkdir(osutils.dirname(path)) base = self._transport.append_bytes(path + ".knit", b"".join(raw_data)) # if base == 0: # chmod. return (key, base, size) def add_raw_records(self, key_sizes, raw_data): """Add raw knit bytes to a storage area. The data is spooled to the container writer in one bytes-record per raw data item. :param sizes: An iterable of tuples containing the key and size of each raw data segment. :param raw_data: A chunked bytestring containing the data. :return: A list of memos to retrieve the record later. Each memo is an opaque index memo. For _KnitKeyAccess the memo is (key, pos, length), where the key is the record key. """ raw_data = b"".join(raw_data) if not isinstance(raw_data, bytes): raise AssertionError(f"data must be plain bytes was {type(raw_data)}") result = [] offset = 0 # TODO: This can be tuned for writing to sftp and other servers where # append() is relatively expensive by grouping the writes to each key # prefix. for key, size in key_sizes: record_bytes = [raw_data[offset : offset + size]] result.append(self.add_raw_record(key, size, record_bytes)) offset += size return result def flush(self): """Flush pending writes on this access object. For .knit files this is a no-op. """ pass def get_raw_records(self, memos_for_retrieval): """Get the raw bytes for a records. :param memos_for_retrieval: An iterable containing the access memo for retrieving the bytes. :return: An iterator over the bytes of the records. """ # first pass, group into same-index request to minimise readv's issued. request_lists = [] current_prefix = None for key, offset, length in memos_for_retrieval: if current_prefix == key[:-1]: current_list.append((offset, length)) else: if current_prefix is not None: request_lists.append((current_prefix, current_list)) current_prefix = key[:-1] current_list = [(offset, length)] # handle the last entry if current_prefix is not None: request_lists.append((current_prefix, current_list)) for prefix, read_vector in request_lists: path = self._mapper.map(prefix) + ".knit" for _pos, data in self._transport.readv(path, read_vector): yield data def annotate_knit(knit, revision_id): """Annotate a knit with no cached annotations. This implementation is for knits with no cached annotations. It will work for knits with cached annotations, but this is not recommended. """ annotator = _KnitAnnotator(knit) return iter(annotator.annotate_flat(revision_id)) class _KnitAnnotator(VersionedFileAnnotator): """Build up the annotations for a text.""" def __init__(self, vf): VersionedFileAnnotator.__init__(self, vf) # TODO: handle Nodes which cannot be extracted # self._ghosts = set() # Map from (key, parent_key) => matching_blocks, should be 'use once' self._matching_blocks = {} # KnitContent objects self._content_objects = {} # The number of children that depend on this fulltext content object self._num_compression_children = {} # Delta records that need their compression parent before they can be # expanded self._pending_deltas = {} # Fulltext records that are waiting for their parents fulltexts before # they can be yielded for annotation self._pending_annotation = {} self._all_build_details = {} def _get_build_graph(self, key): """Get the graphs for building texts and annotations. The data you need for creating a full text may be different than the data you need to annotate that text. (At a minimum, you need both parents to create an annotation, but only need 1 parent to generate the fulltext.) :return: A list of (key, index_memo) records, suitable for passing to read_records_iter to start reading in the raw data from the pack file. """ pending = {key} records = [] ann_keys = set() self._num_needed_children[key] = 1 while pending: # get all pending nodes this_iteration = pending build_details = self._vf._index.get_build_details(this_iteration) self._all_build_details.update(build_details) # new_nodes = self._vf._index._get_entries(this_iteration) pending = set() for key, details in build_details.items(): (index_memo, compression_parent, parent_keys, _record_details) = details self._parent_map[key] = parent_keys self._heads_provider = None records.append((key, index_memo)) # Do we actually need to check _annotated_lines? pending.update( [p for p in parent_keys if p not in self._all_build_details] ) if parent_keys: for parent_key in parent_keys: if parent_key in self._num_needed_children: self._num_needed_children[parent_key] += 1 else: self._num_needed_children[parent_key] = 1 if compression_parent: if compression_parent in self._num_compression_children: self._num_compression_children[compression_parent] += 1 else: self._num_compression_children[compression_parent] = 1 missing_versions = this_iteration.difference(build_details) if missing_versions: for key in missing_versions: if key in self._parent_map and key in self._text_cache: # We already have this text ready, we just need to # yield it later so we get it annotated ann_keys.add(key) parent_keys = self._parent_map[key] for parent_key in parent_keys: if parent_key in self._num_needed_children: self._num_needed_children[parent_key] += 1 else: self._num_needed_children[parent_key] = 1 pending.update( [p for p in parent_keys if p not in self._all_build_details] ) else: raise RevisionNotPresent(key, self._vf) # Generally we will want to read the records in reverse order, because # we find the parent nodes after the children records.reverse() return records, ann_keys def _get_needed_texts(self, key, pb=None): # if True or len(self._vf._immediate_fallback_vfs) > 0: if len(self._vf._immediate_fallback_vfs) > 0: # If we have fallbacks, go to the generic path yield from VersionedFileAnnotator._get_needed_texts(self, key, pb=pb) return while True: try: records, ann_keys = self._get_build_graph(key) for idx, (sub_key, text, num_lines) in enumerate( self._extract_texts(records) ): if pb is not None: pb.update("annotating", idx, len(records)) yield sub_key, text, num_lines for sub_key in ann_keys: text = self._text_cache[sub_key] num_lines = len(text) # bad assumption yield sub_key, text, num_lines return except pack_repo.RetryWithNewPacks as e: self._vf._access.reload_or_raise(e) # The cached build_details are no longer valid self._all_build_details.clear() def _cache_delta_blocks(self, key, compression_parent, delta, lines): parent_lines = self._text_cache[compression_parent] blocks = list(KnitContent.get_line_delta_blocks(delta, parent_lines, lines)) self._matching_blocks[(key, compression_parent)] = blocks def _expand_record( self, key, parent_keys, compression_parent, record, record_details ): delta = None if compression_parent: if compression_parent not in self._content_objects: # Waiting for the parent self._pending_deltas.setdefault(compression_parent, []).append( (key, parent_keys, record, record_details) ) return None # We have the basis parent, so expand the delta num = self._num_compression_children[compression_parent] num -= 1 if num == 0: base_content = self._content_objects.pop(compression_parent) self._num_compression_children.pop(compression_parent) else: self._num_compression_children[compression_parent] = num base_content = self._content_objects[compression_parent] # It is tempting to want to copy_base_content=False for the last # child object. However, whenever noeol=False, # self._text_cache[parent_key] is content._lines. So mutating it # gives very bad results. # The alternative is to copy the lines into text cache, but then we # are copying anyway, so just do it here. content, delta = self._vf._factory.parse_record( key, record, record_details, base_content, copy_base_content=True ) else: # Fulltext record content, _ = self._vf._factory.parse_record( key, record, record_details, None ) if self._num_compression_children.get(key, 0) > 0: self._content_objects[key] = content lines = content.text() self._text_cache[key] = lines if delta is not None: self._cache_delta_blocks(key, compression_parent, delta, lines) return lines def _get_parent_annotations_and_matches(self, key, text, parent_key): """Get the list of annotations for the parent, and the matching lines. :param text: The opaque value given by _get_needed_texts :param parent_key: The key for the parent text :return: (parent_annotations, matching_blocks) parent_annotations is a list as long as the number of lines in parent matching_blocks is a list of (parent_idx, text_idx, len) tuples indicating which lines match between the two texts """ block_key = (key, parent_key) if block_key in self._matching_blocks: blocks = self._matching_blocks.pop(block_key) parent_annotations = self._annotations_cache[parent_key] return parent_annotations, blocks return VersionedFileAnnotator._get_parent_annotations_and_matches( self, key, text, parent_key ) def _process_pending(self, key): """The content for 'key' was just processed. Determine if there is any more pending work to be processed. """ to_return = [] if key in self._pending_deltas: compression_parent = key children = self._pending_deltas.pop(key) for child_key, parent_keys, record, record_details in children: self._expand_record( child_key, parent_keys, compression_parent, record, record_details ) if self._check_ready_for_annotations(child_key, parent_keys): to_return.append(child_key) # Also check any children that are waiting for this parent to be # annotation ready if key in self._pending_annotation: children = self._pending_annotation.pop(key) to_return.extend( [ c for c, p_keys in children if self._check_ready_for_annotations(c, p_keys) ] ) return to_return def _check_ready_for_annotations(self, key, parent_keys): """Return true if this text is ready to be yielded. Otherwise, this will return False, and queue the text into self._pending_annotation """ for parent_key in parent_keys: if parent_key not in self._annotations_cache: # still waiting on at least one parent text, so queue it up # Note that if there are multiple parents, we need to wait # for all of them. self._pending_annotation.setdefault(parent_key, []).append( (key, parent_keys) ) return False return True def _extract_texts(self, records): """Extract the various texts needed based on records.""" # We iterate in the order read, rather than a strict order requested # However, process what we can, and put off to the side things that # still need parents, cleaning them up when those parents are # processed. # Basic data flow: # 1) As 'records' are read, see if we can expand these records into # Content objects (and thus lines) # 2) If a given line-delta is waiting on its compression parent, it # gets queued up into self._pending_deltas, otherwise we expand # it, and put it into self._text_cache and self._content_objects # 3) If we expanded the text, we will then check to see if all # parents have also been processed. If so, this text gets yielded, # else this record gets set aside into pending_annotation # 4) Further, if we expanded the text in (2), we will then check to # see if there are any children in self._pending_deltas waiting to # also be processed. If so, we go back to (2) for those # 5) Further again, if we yielded the text, we can then check if that # 'unlocks' any of the texts in pending_annotations, which should # then get yielded as well # Note that both steps 4 and 5 are 'recursive' in that unlocking one # compression child could unlock yet another, and yielding a fulltext # will also 'unlock' the children that are waiting on that annotation. # (Though also, unlocking 1 parent's fulltext, does not unlock a child # if other parents are also waiting.) # We want to yield content before expanding child content objects, so # that we know when we can re-use the content lines, and the annotation # code can know when it can stop caching fulltexts, as well. # Children that are missing their compression parent for key, record, _digest in self._vf._read_records_iter(records): # ghosts? details = self._all_build_details[key] (_, compression_parent, parent_keys, record_details) = details lines = self._expand_record( key, parent_keys, compression_parent, record, record_details ) if lines is None: # Pending delta should be queued up continue # At this point, we may be able to yield this content, if all # parents are also finished yield_this_text = self._check_ready_for_annotations(key, parent_keys) if yield_this_text: # All parents present yield key, lines, len(lines) to_process = self._process_pending(key) while to_process: this_process = to_process to_process = [] for key in this_process: lines = self._text_cache[key] yield key, lines, len(lines) to_process.extend(self._process_pending(key)) try: from ._knit_load_data_pyx import _load_data_c as _load_data except ModuleNotFoundError as e: osutils.failed_to_load_extension(e) from ._knit_load_data_py import _load_data_py as _load_data bzrformats_3.4.0.orig/bzrformats/lock.py0000644000000000000000000001001315162115107015343 0ustar00# Copyright (C) 2025 Breezy Contributors # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """File locking for bzrformats. Uses fcntl.lockf which has per-process semantics: multiple file descriptors within the same process can share a lock on the same file. """ import fcntl from .errors import LockContention class ReadLock: """OS-level shared (read) lock on a file. The locked file is accessible via the ``f`` attribute. """ def __init__(self, filename): """Acquire a shared read lock on *filename*.""" self.filename = filename self.f = open(filename, "rb") try: fcntl.lockf(self.f, fcntl.LOCK_SH | fcntl.LOCK_NB) except BlockingIOError: self.f.close() raise LockContention(filename) from None except OSError: self.f.close() raise def unlock(self): """Release the lock and close the file.""" fcntl.lockf(self.f, fcntl.LOCK_UN) self.f.close() def temporary_write_lock(self): """Try to upgrade to a write lock. Returns ``(True, write_lock)`` if the upgrade succeeded, or ``(False, self)`` if it failed (read lock is re-acquired). """ fcntl.lockf(self.f, fcntl.LOCK_UN) self.f.close() try: wl = WriteLock(self.filename) return True, wl except (LockContention, OSError): # Re-acquire read lock self.f = open(self.filename, "rb") fcntl.lockf(self.f, fcntl.LOCK_SH | fcntl.LOCK_NB) return False, self class WriteLock: """OS-level exclusive (write) lock on a file. The locked file is accessible via the ``f`` attribute. Creates the file if it does not exist. """ def __init__(self, filename): """Acquire an exclusive write lock on *filename*.""" self.filename = filename try: self.f = open(filename, "rb+") except FileNotFoundError: self.f = open(filename, "wb+") try: fcntl.lockf(self.f, fcntl.LOCK_EX | fcntl.LOCK_NB) except BlockingIOError: self.f.close() raise LockContention(filename) from None except OSError: self.f.close() raise def unlock(self): """Release the lock and close the file.""" fcntl.lockf(self.f, fcntl.LOCK_UN) self.f.close() def restore_read_lock(self): """Downgrade to a read lock, returning a new :class:`ReadLock`.""" fcntl.lockf(self.f, fcntl.LOCK_UN) self.f.close() return ReadLock(self.filename) class LogicalLockResult: """The result of a lock_read/lock_write call. Can be used as a context manager:: with tree.lock_read(): ... """ def __init__(self, unlock, token=None): """Initialize with an unlock callable and optional token.""" self.unlock = unlock self.token = token def __repr__(self): """Return string representation.""" return f"LogicalLockResult({self.unlock})" def __enter__(self): """Enter context manager.""" return self def __exit__(self, exc_type, exc_val, exc_tb): """Exit context manager, releasing the lock.""" try: self.unlock() except BaseException: if exc_type is None: raise return False bzrformats_3.4.0.orig/bzrformats/lru_cache.py0000644000000000000000000003557615162115107016365 0ustar00# Copyright (C) 2006, 2008, 2009 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """A simple least-recently-used (LRU) cache.""" import logging logger = logging.getLogger(__name__) _null_key = object() class _LRUNode: """This maintains the linked-list which is the lru internals.""" __slots__ = ("key", "next_key", "prev", "value") def __init__(self, key, value): self.prev = None self.next_key = _null_key self.key = key self.value = value def __repr__(self): prev_key = None if self.prev is None else self.prev.key return "{}({!r} n:{!r} p:{!r})".format( self.__class__.__name__, self.key, self.next_key, prev_key ) class LRUCache: """A class which manages a cache of entries, removing unused ones.""" def __init__(self, max_cache=100, after_cleanup_count=None): """Initialize an LRUCache. Args: max_cache: The maximum number of entries to cache. after_cleanup_count: After cleanup, we should have at most this many entries. Defaults to 80% of max_cache. """ self._cache = {} # The "HEAD" of the lru linked list self._most_recently_used = None # The "TAIL" of the lru linked list self._least_recently_used = None self._update_max_cache(max_cache, after_cleanup_count) def __contains__(self, key): """Check if a key is in the cache. Args: key: The key to check. Returns: True if the key is cached, False otherwise. """ return key in self._cache def __getitem__(self, key): """Get a value from the cache and mark it as recently used. Args: key: The key to retrieve. Returns: The cached value. Raises: KeyError: If the key is not in the cache. """ cache = self._cache node = cache[key] # Inlined from _record_access to decrease the overhead of __getitem__ # We also have more knowledge about structure if __getitem__ is # succeeding, then we know that self._most_recently_used must not be # None, etc. mru = self._most_recently_used if node is mru: # Nothing to do, this node is already at the head of the queue return node.value # Remove this node from the old location node_prev = node.prev next_key = node.next_key # benchmarking shows that the lookup of _null_key in globals is faster # than the attribute lookup for (node is self._least_recently_used) if next_key is _null_key: # 'node' is the _least_recently_used, because it doesn't have a # 'next' item. So move the current lru to the previous node. self._least_recently_used = node_prev else: node_next = cache[next_key] node_next.prev = node_prev node_prev.next_key = next_key # Insert this node at the front of the list node.next_key = mru.key mru.prev = node self._most_recently_used = node node.prev = None return node.value def __len__(self): """Return the number of items in the cache. Returns: The number of cached entries. """ return len(self._cache) def __setitem__(self, key, value): """Add a new value to the cache.""" if key is _null_key: raise ValueError("cannot use _null_key as a key") if key in self._cache: node = self._cache[key] node.value = value self._record_access(node) else: node = _LRUNode(key, value) self._cache[key] = node self._record_access(node) if len(self._cache) > self._max_cache: # Trigger the cleanup self.cleanup() def cache_size(self): """Get the number of entries we will cache.""" return self._max_cache def get(self, key, default=None): """Get a value from the cache, returning default if not found. Args: key: The key to retrieve. default: Value to return if key is not found. Returns: The cached value or default if not found. """ node = self._cache.get(key, None) if node is None: return default self._record_access(node) return node.value def keys(self): """Get the list of keys currently cached. Note that values returned here may not be available by the time you request them later. This is simply meant as a peak into the current state. :return: An unordered list of keys that are currently cached. """ # GZ 2016-06-04: Maybe just make this return the view? return list(self._cache.keys()) def as_dict(self): """Get a new dict with the same key:value pairs as the cache.""" return {k: n.value for k, n in self._cache.items()} def cleanup(self): """Clear the cache until it shrinks to the requested size. This does not completely wipe the cache, just makes sure it is under the after_cleanup_count. """ # Make sure the cache is shrunk to the correct size while len(self._cache) > self._after_cleanup_count: self._remove_lru() def _record_access(self, node): """Record that key was accessed.""" # Move 'node' to the front of the queue if self._most_recently_used is None: self._most_recently_used = node self._least_recently_used = node return elif node is self._most_recently_used: # Nothing to do, this node is already at the head of the queue return # We've taken care of the tail pointer, remove the node, and insert it # at the front # REMOVE if node is self._least_recently_used: self._least_recently_used = node.prev if node.prev is not None: node.prev.next_key = node.next_key if node.next_key is not _null_key: node_next = self._cache[node.next_key] node_next.prev = node.prev # INSERT node.next_key = self._most_recently_used.key self._most_recently_used.prev = node self._most_recently_used = node node.prev = None def _remove_node(self, node): if node is self._least_recently_used: self._least_recently_used = node.prev self._cache.pop(node.key) # If we have removed all entries, remove the head pointer as well if self._least_recently_used is None: self._most_recently_used = None if node.prev is not None: node.prev.next_key = node.next_key if node.next_key is not _null_key: node_next = self._cache[node.next_key] node_next.prev = node.prev # And remove this node's pointers node.prev = None node.next_key = _null_key def _remove_lru(self): """Remove one entry from the lru, and handle consequences. If there are no more references to the lru, then this entry should be removed from the cache. """ self._remove_node(self._least_recently_used) def clear(self): """Clear out all of the cache.""" # Clean up in LRU order while self._cache: self._remove_lru() def resize(self, max_cache, after_cleanup_count=None): """Change the number of entries that will be cached.""" self._update_max_cache(max_cache, after_cleanup_count=after_cleanup_count) def _update_max_cache(self, max_cache, after_cleanup_count=None): self._max_cache = max_cache if after_cleanup_count is None: self._after_cleanup_count = self._max_cache * 8 // 10 else: self._after_cleanup_count = min(after_cleanup_count, self._max_cache) self.cleanup() class LRUSizeCache(LRUCache): """An LRUCache that removes things based on the size of the values. This differs in that it doesn't care how many actual items there are, it just restricts the cache to be cleaned up after so much data is stored. The size of items added will be computed using compute_size(value), which defaults to len() if not supplied. """ def __init__( self, max_size=1024 * 1024, after_cleanup_size=None, compute_size=None ): """Create a new LRUSizeCache. :param max_size: The max number of bytes to store before we start clearing out entries. :param after_cleanup_size: After cleaning up, shrink everything to this size. :param compute_size: A function to compute the size of the values. We use a function here, so that you can pass 'len' if you are just using simple strings, or a more complex function if you are using something like a list of strings, or even a custom object. The function should take the form "compute_size(value) => integer". If not supplied, it defaults to 'len()' """ self._value_size = 0 self._compute_size = compute_size if compute_size is None: self._compute_size = len self._update_max_size(max_size, after_cleanup_size=after_cleanup_size) LRUCache.__init__(self, max_cache=max(int(max_size // 512), 1)) def __setitem__(self, key, value): """Add a new value to the cache.""" if key is _null_key: raise ValueError("cannot use _null_key as a key") node = self._cache.get(key, None) value_len = self._compute_size(value) if value_len >= self._after_cleanup_size: # The new value is 'too big to fit', as it would fill up/overflow # the cache all by itself logger.debug( "Adding the key %r to an LRUSizeCache failed." " value %d is too big to fit in a the cache" " with size %d %d", key, value_len, self._after_cleanup_size, self._max_size, ) if node is not None: # We won't be replacing the old node, so just remove it self._remove_node(node) return if node is None: node = _LRUNode(key, value) self._cache[key] = node else: self._value_size -= self._compute_size(node.value) self._value_size += value_len self._record_access(node) if self._value_size > self._max_size: # Time to cleanup self.cleanup() def cleanup(self): """Clear the cache until it shrinks to the requested size. This does not completely wipe the cache, just makes sure it is under the after_cleanup_size. """ # Make sure the cache is shrunk to the correct size while self._value_size > self._after_cleanup_size: self._remove_lru() def _remove_node(self, node): self._value_size -= self._compute_size(node.value) LRUCache._remove_node(self, node) def resize(self, max_size, after_cleanup_size=None): """Change the number of bytes that will be cached.""" self._update_max_size(max_size, after_cleanup_size=after_cleanup_size) max_cache = max(int(max_size // 512), 1) self._update_max_cache(max_cache) def _update_max_size(self, max_size, after_cleanup_size=None): self._max_size = max_size if after_cleanup_size is None: self._after_cleanup_size = self._max_size * 8 // 10 else: self._after_cleanup_size = min(after_cleanup_size, self._max_size) from collections import deque class FIFOCache(dict): """A cache that evicts the oldest entries first (first-in, first-out).""" def __init__(self, max_cache=100, after_cleanup_count=None): """Initialize the FIFO cache with a maximum size.""" dict.__init__(self) self._max_cache = max_cache if after_cleanup_count is None: self._after_cleanup_count = self._max_cache * 8 // 10 else: self._after_cleanup_count = min(after_cleanup_count, self._max_cache) self._cleanup = {} self._queue = deque() def __setitem__(self, key, value): """Set an item, evicting old entries if necessary.""" self.add(key, value) def __delitem__(self, key): """Delete an item from the cache.""" self._queue.remove(key) self._remove(key) def add(self, key, value, cleanup=None): """Add a key/value pair, with an optional cleanup callback.""" if key in self: del self[key] self._queue.append(key) dict.__setitem__(self, key, value) if cleanup is not None: self._cleanup[key] = cleanup if len(self) > self._max_cache: self.cleanup() def cache_size(self): """Return the maximum number of entries this cache will hold.""" return self._max_cache def cleanup(self): """Evict oldest entries until the cache is within its target size.""" while len(self) > self._after_cleanup_count: self._remove_oldest() def clear(self): """Remove all entries from the cache.""" while self: self._remove_oldest() def _remove(self, key): cleanup = self._cleanup.pop(key, None) val = dict.pop(self, key) if cleanup is not None: cleanup(key, val) return val def _remove_oldest(self): key = self._queue.popleft() self._remove(key) def resize(self, max_cache, after_cleanup_count=None): """Resize the cache to hold at most *max_cache* entries.""" self._max_cache = max_cache if after_cleanup_count is None: self._after_cleanup_count = max_cache * 8 // 10 else: self._after_cleanup_count = min(max_cache, after_cleanup_count) if len(self) > self._max_cache: self.cleanup() def setdefault(self, key, defaultval=None): """Return the value for *key*, setting it to *defaultval* if missing.""" if key in self: return self[key] self[key] = defaultval return defaultval bzrformats_3.4.0.orig/bzrformats/merge.py0000644000000000000000000005007015162115103015515 0ustar00# Copyright (C) 2005-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Plan merge implementation for versioned files.""" import logging import patiencediff from vcsgraph import graph as _mod_graph from . import revision as _mod_revision logger = logging.getLogger(__name__) from vcsgraph.tsort import merge_sort from . import weave from .errors import RevisionNotPresent class _PlanMergeBase: def __init__(self, a_rev, b_rev, vf, key_prefix): """Contructor. :param a_rev: Revision-id of one revision to merge :param b_rev: Revision-id of the other revision to merge :param vf: A VersionedFiles containing both revisions :param key_prefix: A prefix for accessing keys in vf, typically (file_id,). """ self.a_rev = a_rev self.b_rev = b_rev self.vf = vf self._last_lines = None self._last_lines_revision_id = None self._cached_matching_blocks = {} self._key_prefix = key_prefix self._precache_tip_lines() def _precache_tip_lines(self): lines = self.get_lines([self.a_rev, self.b_rev]) self.lines_a = lines[self.a_rev] self.lines_b = lines[self.b_rev] def get_lines(self, revisions): """Get lines for revisions from the backing VersionedFiles. :raises RevisionNotPresent: on absent texts. """ keys = [(self._key_prefix + (rev,)) for rev in revisions] result = {} for record in self.vf.get_record_stream(keys, "unordered", True): if record.storage_kind == "absent": raise RevisionNotPresent(record.key, self.vf) result[record.key[-1]] = record.get_bytes_as("lines") return result def plan_merge(self): """Generate a 'plan' for merging the two revisions. This involves comparing their texts and determining the cause of differences. If text A has a line and text B does not, then either the line was added to text A, or it was deleted from B. Once the causes are combined, they are written out in the format described in VersionedFile.plan_merge """ blocks = self._get_matching_blocks(self.a_rev, self.b_rev) unique_a, unique_b = self._unique_lines(blocks) new_a, killed_b = self._determine_status(self.a_rev, unique_a) new_b, killed_a = self._determine_status(self.b_rev, unique_b) return self._iter_plan(blocks, new_a, killed_b, new_b, killed_a) def _iter_plan(self, blocks, new_a, killed_b, new_b, killed_a): last_i = 0 last_j = 0 for i, j, n in blocks: for a_index in range(last_i, i): if a_index in new_a: if a_index in killed_b: yield "conflicted-a", self.lines_a[a_index] else: yield "new-a", self.lines_a[a_index] else: yield "killed-b", self.lines_a[a_index] for b_index in range(last_j, j): if b_index in new_b: if b_index in killed_a: yield "conflicted-b", self.lines_b[b_index] else: yield "new-b", self.lines_b[b_index] else: yield "killed-a", self.lines_b[b_index] # handle common lines for a_index in range(i, i + n): yield "unchanged", self.lines_a[a_index] last_i = i + n last_j = j + n def _get_matching_blocks(self, left_revision, right_revision): """Return a description of which sections of two revisions match. See SequenceMatcher.get_matching_blocks """ cached = self._cached_matching_blocks.get((left_revision, right_revision)) if cached is not None: return cached if self._last_lines_revision_id == left_revision: left_lines = self._last_lines right_lines = self.get_lines([right_revision])[right_revision] else: lines = self.get_lines([left_revision, right_revision]) left_lines = lines[left_revision] right_lines = lines[right_revision] self._last_lines = right_lines self._last_lines_revision_id = right_revision matcher = patiencediff.PatienceSequenceMatcher(None, left_lines, right_lines) return matcher.get_matching_blocks() def _unique_lines(self, matching_blocks): """Analyse matching_blocks to determine which lines are unique. :return: a tuple of (unique_left, unique_right), where the values are sets of line numbers of unique lines. """ last_i = 0 last_j = 0 unique_left = [] unique_right = [] for i, j, n in matching_blocks: unique_left.extend(range(last_i, i)) unique_right.extend(range(last_j, j)) last_i = i + n last_j = j + n return unique_left, unique_right @staticmethod def _subtract_plans(old_plan, new_plan): """Remove changes from new_plan that came from old_plan. It is assumed that the difference between the old_plan and new_plan is their choice of 'b' text. All lines from new_plan that differ from old_plan are emitted verbatim. All lines from new_plan that match old_plan but are not about the 'b' revision are emitted verbatim. Lines that match and are about the 'b' revision are the lines we don't want, so we convert 'killed-b' -> 'unchanged', and 'new-b' is skipped entirely. """ matcher = patiencediff.PatienceSequenceMatcher(None, old_plan, new_plan) last_j = 0 for _i, j, n in matcher.get_matching_blocks(): for jj in range(last_j, j): yield new_plan[jj] for jj in range(j, j + n): plan_line = new_plan[jj] if plan_line[0] == "new-b": pass elif plan_line[0] == "killed-b": yield "unchanged", plan_line[1] else: yield plan_line last_j = j + n class _PlanMerge(_PlanMergeBase): """Plan an annotate merge using on-the-fly annotation.""" def __init__(self, a_rev, b_rev, vf, key_prefix): super().__init__(a_rev, b_rev, vf, key_prefix) self.a_key = self._key_prefix + (self.a_rev,) self.b_key = self._key_prefix + (self.b_rev,) self.graph = _mod_graph.Graph(self.vf) heads = self.graph.heads((self.a_key, self.b_key)) if len(heads) == 1: # one side dominates, so we can just return its values, yay for # per-file graphs # Ideally we would know that before we get this far self._head_key = heads.pop() other = b_rev if self._head_key == self.a_key else a_rev logger.debug( "found dominating revision for %s\n%s > %s", self.vf, self._head_key[-1], other, ) self._weave = None else: self._head_key = None self._build_weave() def _precache_tip_lines(self): # Turn this into a no-op, because we will do this later pass def _find_recursive_lcas(self): """Find all the ancestors back to a unique lca.""" cur_ancestors = (self.a_key, self.b_key) # graph.find_lca(uncommon, keys) now returns plain NULL_REVISION, # rather than a key tuple. We will just map that directly to no common # ancestors. parent_map = {} while True: next_lcas = self.graph.find_lca(*cur_ancestors) # Map a plain NULL_REVISION to a simple no-ancestors if next_lcas == {_mod_revision.NULL_REVISION}: next_lcas = () # Order the lca's based on when they were merged into the tip # While the actual merge portion of weave merge uses a set() of # active revisions, the order of insertion *does* effect the # implicit ordering of the texts. for rev_key in cur_ancestors: ordered_parents = tuple(self.graph.find_merge_order(rev_key, next_lcas)) parent_map[rev_key] = ordered_parents if len(next_lcas) == 0: break elif len(next_lcas) == 1: parent_map[list(next_lcas)[0]] = () break elif len(next_lcas) > 2: # More than 2 lca's, fall back to grabbing all nodes between # this and the unique lca. logger.debug( "More than 2 LCAs, falling back to all nodes for: %s, %s\n=> %s", self.a_key, self.b_key, cur_ancestors, ) cur_lcas = next_lcas while len(cur_lcas) > 1: cur_lcas = self.graph.find_lca(*cur_lcas) if len(cur_lcas) == 0: # No common base to find, use the full ancestry unique_lca = None else: unique_lca = list(cur_lcas)[0] if unique_lca == _mod_revision.NULL_REVISION: # find_lca will return a plain 'NULL_REVISION' rather # than a key tuple when there is no common ancestor, we # prefer to just use None, because it doesn't confuse # _get_interesting_texts() unique_lca = None parent_map.update(self._find_unique_parents(next_lcas, unique_lca)) break cur_ancestors = next_lcas return parent_map def _find_unique_parents(self, tip_keys, base_key): """Find ancestors of tip that aren't ancestors of base. :param tip_keys: Nodes that are interesting :param base_key: Cull all ancestors of this node :return: The parent map for all revisions between tip_keys and base_key. base_key will be included. References to nodes outside of the ancestor set will also be removed. """ # TODO: this would be simpler if find_unique_ancestors took a list # instead of a single tip, internally it supports it, but it # isn't a "backwards compatible" api change. if base_key is None: parent_map = dict(self.graph.iter_ancestry(tip_keys)) # We remove NULL_REVISION because it isn't a proper tuple key, and # thus confuses things like _get_interesting_texts, and our logic # to add the texts into the memory weave. if _mod_revision.NULL_REVISION in parent_map: parent_map.pop(_mod_revision.NULL_REVISION) else: interesting = set() for tip in tip_keys: interesting.update(self.graph.find_unique_ancestors(tip, [base_key])) parent_map = self.graph.get_parent_map(interesting) parent_map[base_key] = () culled_parent_map, child_map, tails = self._remove_external_references( parent_map ) # Remove all the tails but base_key if base_key is not None: tails.remove(base_key) self._prune_tails(culled_parent_map, child_map, tails) # Now remove all the uninteresting 'linear' regions simple_map = _mod_graph.collapse_linear_regions(culled_parent_map) return simple_map @staticmethod def _remove_external_references(parent_map): """Remove references that go outside of the parent map. :param parent_map: Something returned from Graph.get_parent_map(keys) :return: (filtered_parent_map, child_map, tails) filtered_parent_map is parent_map without external references child_map is the {parent_key: [child_keys]} mapping tails is a list of nodes that do not have any parents in the map """ # TODO: The basic effect of this function seems more generic than # _PlanMerge. But the specific details of building a child_map, # and computing tails seems very specific to _PlanMerge. # Still, should this be in Graph land? filtered_parent_map = {} child_map = {} tails = [] for key, parent_keys in parent_map.items(): culled_parent_keys = [p for p in parent_keys if p in parent_map] if not culled_parent_keys: tails.append(key) for parent_key in culled_parent_keys: child_map.setdefault(parent_key, []).append(key) # TODO: Do we want to do this, it adds overhead for every node, # just to say that the node has no children child_map.setdefault(key, []) filtered_parent_map[key] = culled_parent_keys return filtered_parent_map, child_map, tails @staticmethod def _prune_tails(parent_map, child_map, tails_to_remove): """Remove tails from the parent map. This will remove the supplied revisions until no more children have 0 parents. :param parent_map: A dict of {child: [parents]}, this dictionary will be modified in place. :param tails_to_remove: A list of tips that should be removed, this list will be consumed :param child_map: The reverse dict of parent_map ({parent: [children]}) this dict will be modified :return: None, parent_map will be modified in place. """ while tails_to_remove: next = tails_to_remove.pop() parent_map.pop(next) children = child_map.pop(next) for child in children: child_parents = parent_map[child] child_parents.remove(next) if len(child_parents) == 0: tails_to_remove.append(child) def _get_interesting_texts(self, parent_map): """Return a dict of texts we are interested in. Note that the input is in key tuples, but the output is in plain revision ids. :param parent_map: The output from _find_recursive_lcas :return: A dict of {'revision_id':lines} as returned by _PlanMergeBase.get_lines() """ all_revision_keys = set(parent_map) all_revision_keys.add(self.a_key) all_revision_keys.add(self.b_key) # Everything else is in 'keys' but get_lines is in 'revision_ids' all_texts = self.get_lines([k[-1] for k in all_revision_keys]) return all_texts def _build_weave(self): self._weave = weave.Weave(weave_name="in_memory_weave", allow_reserved=True) parent_map = self._find_recursive_lcas() all_texts = self._get_interesting_texts(parent_map) # Note: Unfortunately, the order given by topo_sort will effect the # ordering resolution in the output. Specifically, if you add A then B, # then in the output text A lines will show up before B lines. And, of # course, topo_sort doesn't guarantee any real ordering. # So we use merge_sort, and add a fake node on the tip. # This ensures that left-hand parents will always be inserted into the # weave before right-hand parents. tip_key = self._key_prefix + (_mod_revision.CURRENT_REVISION,) parent_map[tip_key] = (self.a_key, self.b_key) for _seq_num, key, _depth, _eom in reversed(merge_sort(parent_map, tip_key)): if key == tip_key: continue # for key in tsort.topo_sort(parent_map): parent_keys = parent_map[key] revision_id = key[-1] parent_ids = [k[-1] for k in parent_keys] self._weave.add_lines(revision_id, parent_ids, all_texts[revision_id]) def plan_merge(self): """Generate a 'plan' for merging the two revisions. This involves comparing their texts and determining the cause of differences. If text A has a line and text B does not, then either the line was added to text A, or it was deleted from B. Once the causes are combined, they are written out in the format described in VersionedFile.plan_merge """ if self._head_key is not None: # There was a single head if self._head_key == self.a_key: plan = "new-a" else: if self._head_key != self.b_key: raise AssertionError( f"There was an invalid head: {self.b_key} != {self._head_key}" ) plan = "new-b" head_rev = self._head_key[-1] lines = self.get_lines([head_rev])[head_rev] return ((plan, line) for line in lines) return self._weave.plan_merge(self.a_rev, self.b_rev) class _PlanLCAMerge(_PlanMergeBase): """Merger that uses LCA. This merge algorithm differs from _PlanMerge in that: 1. comparisons are done against LCAs only 2. cases where a contested line is new versus one LCA but old versus another are marked as conflicts, by emitting the line as conflicted-a or conflicted-b. This is faster, and hopefully produces more useful output. """ def __init__(self, a_rev, b_rev, vf, key_prefix, graph): _PlanMergeBase.__init__(self, a_rev, b_rev, vf, key_prefix) lcas = graph.find_lca(key_prefix + (a_rev,), key_prefix + (b_rev,)) self.lcas = set() for lca in lcas: if lca == _mod_revision.NULL_REVISION: self.lcas.add(lca) else: self.lcas.add(lca[-1]) for lca in self.lcas: lca_lines = [] if _mod_revision.is_null(lca) else self.get_lines([lca])[lca] matcher = patiencediff.PatienceSequenceMatcher( None, self.lines_a, lca_lines ) blocks = list(matcher.get_matching_blocks()) self._cached_matching_blocks[(a_rev, lca)] = blocks matcher = patiencediff.PatienceSequenceMatcher( None, self.lines_b, lca_lines ) blocks = list(matcher.get_matching_blocks()) self._cached_matching_blocks[(b_rev, lca)] = blocks def _determine_status(self, revision_id, unique_line_numbers): """Determines the status unique lines versus all lcas. Basically, determines why the line is unique to this revision. A line may be determined new, killed, or both. If a line is determined new, that means it was not present in at least one LCA, and is not present in the other merge revision. If a line is determined killed, that means the line was present in at least one LCA. If a line is killed and new, this indicates that the two merge revisions contain differing conflict resolutions. :param revision_id: The id of the revision in which the lines are unique :param unique_line_numbers: The line numbers of unique lines. :return: a tuple of (new_this, killed_other) """ new = set() killed = set() unique_line_numbers = set(unique_line_numbers) for lca in self.lcas: blocks = self._get_matching_blocks(revision_id, lca) unique_vs_lca, _ignored = self._unique_lines(blocks) new.update(unique_line_numbers.intersection(unique_vs_lca)) killed.update(unique_line_numbers.difference(unique_vs_lca)) return new, killed bzrformats_3.4.0.orig/bzrformats/multiparent.py0000644000000000000000000006400315162074037016775 0ustar00# Copyright (C) 2007-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Multi-parent diff implementation for versioned files.""" import contextlib import os from io import BytesIO from . import errors def topo_iter_keys(vf, keys=None): """Iterate through keys in topological order.""" if keys is None: keys = vf.keys() parents = vf.get_parent_map(keys) return _topo_iter(parents, keys) def topo_iter(vf, versions=None): """Iterate through versions in topological order.""" if versions is None: versions = vf.versions() parents = vf.get_parent_map(versions) return _topo_iter(parents, versions) def _topo_iter(parents, versions): seen = set() descendants = {} def pending_parents(version): if parents[version] is None: return [] return [v for v in parents[version] if v in versions and v not in seen] for version_id in versions: if parents[version_id] is None: # parentless continue for parent_id in parents[version_id]: descendants.setdefault(parent_id, []).append(version_id) cur = [v for v in versions if len(pending_parents(v)) == 0] while len(cur) > 0: next = [] for version_id in cur: if version_id in seen: continue if len(pending_parents(version_id)) != 0: continue next.extend(descendants.get(version_id, [])) yield version_id seen.add(version_id) cur = next class MultiParent: """A multi-parent diff.""" __slots__ = ["hunks"] def __init__(self, hunks=None): """Initialize a MultiParent diff.""" if hunks is not None: self.hunks = hunks else: self.hunks = [] def __repr__(self): """Return a string representation of this MultiParent.""" return f"MultiParent({self.hunks!r})" def __eq__(self, other): """Check equality with another MultiParent.""" if self.__class__ is not other.__class__: return False return self.hunks == other.hunks @staticmethod def from_lines(text, parents=(), left_blocks=None): """Produce a MultiParent from a list of lines and parents.""" try: import patiencediff except ImportError as e: raise ImportError( "patiencediff module is required for multiparent operations" ) from e def compare(parent): matcher = patiencediff.PatienceSequenceMatcher(None, parent, text) return matcher.get_matching_blocks() if len(parents) > 0: if left_blocks is None: left_blocks = compare(parents[0]) parent_comparisons = [left_blocks] + [compare(p) for p in parents[1:]] else: parent_comparisons = [] cur_line = 0 new_text = NewText([]) block_iter = [iter(i) for i in parent_comparisons] diff = MultiParent([]) def next_block(p): try: return next(block_iter[p]) except StopIteration: return None cur_block = [next_block(p) for p, i in enumerate(block_iter)] while cur_line < len(text): best_match = None for p, block in enumerate(cur_block): if block is None: continue i, j, n = block while j + n <= cur_line: block = cur_block[p] = next_block(p) if block is None: break i, j, n = block if block is None: continue if j > cur_line: continue offset = cur_line - j i += offset j = cur_line n -= offset if n == 0: continue if best_match is None or n > best_match.num_lines: best_match = ParentText(p, i, j, n) if best_match is None: new_text.lines.append(text[cur_line]) cur_line += 1 else: if len(new_text.lines) > 0: diff.hunks.append(new_text) new_text = NewText([]) diff.hunks.append(best_match) cur_line += best_match.num_lines if len(new_text.lines) > 0: diff.hunks.append(new_text) return diff def get_matching_blocks(self, parent, parent_len): """Get matching blocks for a specific parent.""" for hunk in self.hunks: if not isinstance(hunk, ParentText) or hunk.parent != parent: continue yield (hunk.parent_pos, hunk.child_pos, hunk.num_lines) yield parent_len, self.num_lines(), 0 def to_lines(self, parents=()): """Contruct a fulltext from this diff and its parents.""" mpvf = MultiMemoryVersionedFile() for num, parent in enumerate(parents): mpvf.add_version(BytesIO(parent).readlines(), num, []) mpvf.add_diff(self, "a", list(range(len(parents)))) return mpvf.get_line_list(["a"])[0] @classmethod def from_texts(cls, text, parents=()): """Produce a MultiParent from a text and list of parent text.""" return cls.from_lines( BytesIO(text).readlines(), [BytesIO(p).readlines() for p in parents] ) def to_patch(self): """Yield text lines for a patch.""" for hunk in self.hunks: yield from hunk.to_patch() def patch_len(self): """Return the length of the patch.""" return len(b"".join(self.to_patch())) def zipped_patch_len(self): """Return the length of the gzipped patch.""" return len(gzip_string(self.to_patch())) @classmethod def from_patch(cls, text): """Create a MultiParent from its string form.""" return cls._from_patch(BytesIO(text)) @staticmethod def _from_patch(lines): r"""This is private because it is essential to split lines on \n only.""" line_iter = iter(lines) hunks = [] cur_line = None while True: try: cur_line = next(line_iter) except StopIteration: break first_char = cur_line[0:1] if first_char == b"i": num_lines = int(cur_line.split(b" ")[1]) hunk_lines = [next(line_iter) for _ in range(num_lines)] hunk_lines[-1] = hunk_lines[-1][:-1] hunks.append(NewText(hunk_lines)) elif first_char == b"\n": hunks[-1].lines[-1] += b"\n" else: if not (first_char == b"c"): raise AssertionError(first_char) parent, parent_pos, child_pos, num_lines = ( int(v) for v in cur_line.split(b" ")[1:] ) hunks.append(ParentText(parent, parent_pos, child_pos, num_lines)) return MultiParent(hunks) def range_iterator(self): """Iterate through the hunks, with range indicated. kind is "new" or "parent". for "new", data is a list of lines. for "parent", data is (parent, parent_start, parent_end) :return: a generator of (start, end, kind, data) """ start = 0 for hunk in self.hunks: if isinstance(hunk, NewText): kind = "new" end = start + len(hunk.lines) data = hunk.lines else: kind = "parent" start = hunk.child_pos end = start + hunk.num_lines data = (hunk.parent, hunk.parent_pos, hunk.parent_pos + hunk.num_lines) yield start, end, kind, data start = end def num_lines(self): """The number of lines in the output text.""" extra_n = 0 for hunk in reversed(self.hunks): if isinstance(hunk, ParentText): return hunk.child_pos + hunk.num_lines + extra_n extra_n += len(hunk.lines) return extra_n def is_snapshot(self): """Return true of this hunk is effectively a fulltext.""" if len(self.hunks) != 1: return False return isinstance(self.hunks[0], NewText) class NewText: """The contents of text that is introduced by this text.""" __slots__ = ["lines"] def __init__(self, lines): """Initialize a NewText hunk.""" self.lines = lines def __eq__(self, other): """Check equality with another NewText.""" if self.__class__ is not other.__class__: return False return other.lines == self.lines def __repr__(self): """Return a string representation of this NewText.""" return f"NewText({self.lines!r})" def to_patch(self): """Generate patch lines for this NewText.""" yield b"i %d\n" % len(self.lines) yield from self.lines yield b"\n" class ParentText: """A reference to text present in a parent text.""" __slots__ = ["child_pos", "num_lines", "parent", "parent_pos"] def __init__(self, parent, parent_pos, child_pos, num_lines): """Initialize a ParentText hunk.""" self.parent = parent self.parent_pos = parent_pos self.child_pos = child_pos self.num_lines = num_lines def _as_dict(self): return { b"parent": self.parent, b"parent_pos": self.parent_pos, b"child_pos": self.child_pos, b"num_lines": self.num_lines, } def __repr__(self): """Return a string representation of this ParentText.""" return ( "ParentText({parent!r}, {parent_pos!r}, {child_pos!r}," " {num_lines!r})".format(**self._as_dict()) ) def __eq__(self, other): """Check equality with another ParentText.""" if self.__class__ is not other.__class__: return False return self._as_dict() == other._as_dict() def to_patch(self): """Generate patch lines for this ParentText.""" yield ( b"c %(parent)d %(parent_pos)d %(child_pos)d %(num_lines)d\n" % self._as_dict() ) class BaseVersionedFile: """Pseudo-VersionedFile skeleton for MultiParent.""" def __init__(self, snapshot_interval=25, max_snapshots=None): """Initialize a BaseVersionedFile.""" self._lines = {} self._parents = {} self._snapshots = set() self.snapshot_interval = snapshot_interval self.max_snapshots = max_snapshots def versions(self): """Return an iterator of version IDs.""" return iter(self._parents) def has_version(self, version): """Check if a version exists.""" return version in self._parents def do_snapshot(self, version_id, parent_ids): """Determine whether to perform a snapshot for this version.""" if self.snapshot_interval is None: return False if ( self.max_snapshots is not None and len(self._snapshots) == self.max_snapshots ): return False if len(parent_ids) == 0: return True for _ignored in range(self.snapshot_interval): if len(parent_ids) == 0: return False version_ids = parent_ids parent_ids = [] for version_id in version_ids: if version_id not in self._snapshots: parent_ids.extend(self._parents[version_id]) else: return True def add_version( self, lines, version_id, parent_ids, force_snapshot=None, single_parent=False ): r"""Add a version to the versionedfile. :param lines: The list of lines to add. Must be split on '\n'. :param version_id: The version_id of the version to add :param force_snapshot: If true, force this version to be added as a snapshot version. If false, force this version to be added as a diff. If none, determine this automatically. :param single_parent: If true, use a single parent, rather than multiple parents. """ if force_snapshot is None: do_snapshot = self.do_snapshot(version_id, parent_ids) else: do_snapshot = force_snapshot if do_snapshot: self._snapshots.add(version_id) diff = MultiParent([NewText(lines)]) else: if single_parent: parent_lines = self.get_line_list(parent_ids[:1]) else: parent_lines = self.get_line_list(parent_ids) diff = MultiParent.from_lines(lines, parent_lines) if diff.is_snapshot(): self._snapshots.add(version_id) self.add_diff(diff, version_id, parent_ids) self._lines[version_id] = lines def get_parents(self, version_id): """Get the parent IDs for a version.""" return self._parents[version_id] def make_snapshot(self, version_id): """Create a snapshot for the given version.""" snapdiff = MultiParent([NewText(self.cache_version(version_id))]) self.add_diff(snapdiff, version_id, self._parents[version_id]) self._snapshots.add(version_id) def import_versionedfile( self, vf, snapshots, no_cache=True, single_parent=False, verify=False, progress_callback=None, ): """Import all revisions of a versionedfile. :param vf: The versionedfile to import :param snapshots: If provided, the revisions to make snapshots of. Otherwise, this will be auto-determined :param no_cache: If true, clear the cache after every add. :param single_parent: If true, omit all but one parent text, (but retain parent metadata). :param progress_callback: Optional callback function that will be called with (current, total) to report progress. """ if not (no_cache or not verify): raise ValueError() revisions = set(vf.versions()) total = len(revisions) processed = 0 while len(revisions) > 0: added = set() for revision in revisions: parents = vf.get_parents(revision) if [p for p in parents if p not in self._parents] != []: continue lines = [a + b" " + l for a, l in vf.annotate(revision)] if snapshots is None: force_snapshot = None else: force_snapshot = revision in snapshots self.add_version( lines, revision, parents, force_snapshot, single_parent ) added.add(revision) if no_cache: self.clear_cache() vf.clear_cache() if verify: if not (lines == self.get_line_list([revision])[0]): raise AssertionError() self.clear_cache() processed += len(added) if progress_callback: progress_callback(processed, total) revisions = [r for r in revisions if r not in added] def select_snapshots(self, vf): """Determine which versions to add as snapshots.""" build_ancestors = {} snapshots = set() for version_id in topo_iter(vf): potential_build_ancestors = set(vf.get_parents(version_id)) parents = vf.get_parents(version_id) if len(parents) == 0: snapshots.add(version_id) build_ancestors[version_id] = set() else: for parent in vf.get_parents(version_id): potential_build_ancestors.update(build_ancestors[parent]) if len(potential_build_ancestors) > self.snapshot_interval: snapshots.add(version_id) build_ancestors[version_id] = set() else: build_ancestors[version_id] = potential_build_ancestors return snapshots def select_by_size(self, num): """Select snapshots for minimum output size.""" num -= len(self._snapshots) new_snapshots = self.get_size_ranking()[-num:] return [v for n, v in new_snapshots] def get_size_ranking(self): """Get versions ranked by size.""" versions = [] for version_id in self.versions(): if version_id in self._snapshots: continue diff_len = self.get_diff(version_id).patch_len() snapshot_len = MultiParent( [NewText(self.cache_version(version_id))] ).patch_len() versions.append((snapshot_len - diff_len, version_id)) versions.sort() return versions def import_diffs(self, vf): """Import the diffs from another pseudo-versionedfile.""" for version_id in vf.versions(): self.add_diff(vf.get_diff(version_id), version_id, vf._parents[version_id]) def get_build_ranking(self): """Return revisions sorted by how much they reduce build complexity.""" could_avoid = {} referenced_by = {} for version_id in topo_iter(self): could_avoid[version_id] = set() if version_id not in self._snapshots: for parent_id in self._parents[version_id]: could_avoid[version_id].update(could_avoid[parent_id]) could_avoid[version_id].update(self._parents) could_avoid[version_id].discard(version_id) for avoid_id in could_avoid[version_id]: referenced_by.setdefault(avoid_id, set()).add(version_id) available_versions = list(self.versions()) ranking = [] while len(available_versions) > 0: available_versions.sort( key=lambda x: len(could_avoid[x]) * len(referenced_by.get(x, [])) ) selected = available_versions.pop() ranking.append(selected) for version_id in referenced_by[selected]: could_avoid[version_id].difference_update(could_avoid[selected]) for version_id in could_avoid[selected]: referenced_by[version_id].difference_update(referenced_by[selected]) return ranking def clear_cache(self): """Clear the cached lines.""" self._lines.clear() def get_line_list(self, version_ids): """Get a list of line lists for the given version IDs.""" return [self.cache_version(v) for v in version_ids] def cache_version(self, version_id): """Get the lines for a version, caching if necessary.""" try: return self._lines[version_id] except KeyError: pass self.get_diff(version_id) lines = [] reconstructor = _Reconstructor(self, self._lines, self._parents) reconstructor.reconstruct_version(lines, version_id) self._lines[version_id] = lines return lines class MultiMemoryVersionedFile(BaseVersionedFile): """Memory-backed pseudo-versionedfile.""" def __init__(self, snapshot_interval=25, max_snapshots=None): """Initialize a MultiMemoryVersionedFile.""" BaseVersionedFile.__init__(self, snapshot_interval, max_snapshots) self._diffs = {} def add_diff(self, diff, version_id, parent_ids): """Add a diff to the versioned file.""" self._diffs[version_id] = diff self._parents[version_id] = parent_ids def get_diff(self, version_id): """Get the diff for a version.""" try: return self._diffs[version_id] except KeyError as e: raise errors.RevisionNotPresent(version_id, self) from e def destroy(self): """Clear all diffs.""" self._diffs = {} class MultiVersionedFile(BaseVersionedFile): """Disk-backed pseudo-versionedfile.""" def __init__(self, filename, snapshot_interval=25, max_snapshots=None): """Initialize a MultiVersionedFile.""" BaseVersionedFile.__init__(self, snapshot_interval, max_snapshots) self._filename = filename self._diff_offset = {} def get_diff(self, version_id): """Get the diff for a version from disk.""" import gzip start, count = self._diff_offset[version_id] with open(self._filename + ".mpknit", "rb") as infile: infile.seek(start) sio = BytesIO(infile.read(count)) with gzip.GzipFile(None, mode="rb", fileobj=sio) as zip_file: zip_file.readline() content = zip_file.read() return MultiParent.from_patch(content) def add_diff(self, diff, version_id, parent_ids): """Add a diff to the versioned file on disk.""" import gzip import itertools with open(self._filename + ".mpknit", "ab") as outfile: outfile.seek(0, 2) # workaround for windows bug: # .tell() for files opened in 'ab' mode # before any write returns 0 start = outfile.tell() with gzip.GzipFile(None, mode="ab", fileobj=outfile) as zipfile: zipfile.writelines( itertools.chain([b"version %s\n" % version_id], diff.to_patch()) ) end = outfile.tell() self._diff_offset[version_id] = (start, end - start) self._parents[version_id] = parent_ids def destroy(self): """Remove the files from disk.""" with contextlib.suppress(FileNotFoundError): os.unlink(self._filename + ".mpknit") with contextlib.suppress(FileNotFoundError): os.unlink(self._filename + ".mpidx") def save(self): """Save the index to disk.""" import fastbencode as bencode with open(self._filename + ".mpidx", "wb") as f: f.write( bencode.bencode( (self._parents, list(self._snapshots), self._diff_offset) ) ) def load(self): """Load the index from disk.""" import fastbencode as bencode with open(self._filename + ".mpidx", "rb") as f: self._parents, snapshots, self._diff_offset = bencode.bdecode(f.read()) self._snapshots = set(snapshots) class _Reconstructor: """Build a text from the diffs, ancestry graph and cached lines.""" def __init__(self, diffs, lines, parents): self.diffs = diffs self.lines = lines self.parents = parents self.cursor = {} def reconstruct(self, lines, parent_text, version_id): """Append the lines referred to by a ParentText to lines.""" parent_id = self.parents[version_id][parent_text.parent] end = parent_text.parent_pos + parent_text.num_lines return self._reconstruct(lines, parent_id, parent_text.parent_pos, end) def _reconstruct(self, lines, req_version_id, req_start, req_end): """Append lines for the requested version_id range.""" # stack of pending range requests if req_start == req_end: return pending_reqs = [(req_version_id, req_start, req_end)] while len(pending_reqs) > 0: req_version_id, req_start, req_end = pending_reqs.pop() # lazily allocate cursors for versions if req_version_id in self.lines: lines.extend(self.lines[req_version_id][req_start:req_end]) continue try: start, end, kind, data, iterator = self.cursor[req_version_id] except KeyError: iterator = self.diffs.get_diff(req_version_id).range_iterator() start, end, kind, data = next(iterator) if start > req_start: iterator = self.diffs.get_diff(req_version_id).range_iterator() start, end, kind, data = next(iterator) # find the first hunk relevant to the request while end <= req_start: start, end, kind, data = next(iterator) self.cursor[req_version_id] = start, end, kind, data, iterator # if the hunk can't satisfy the whole request, split it in two, # and leave the second half for later. if req_end > end: pending_reqs.append((req_version_id, end, req_end)) req_end = end if kind == "new": lines.extend(data[req_start - start : (req_end - start)]) else: # If the hunk is a ParentText, rewrite it as a range request # for the parent, and make it the next pending request. parent, parent_start, parent_end = data new_version_id = self.parents[req_version_id][parent] new_start = parent_start + req_start - start new_end = parent_end + req_end - end pending_reqs.append((new_version_id, new_start, new_end)) def reconstruct_version(self, lines, version_id): length = self.diffs.get_diff(version_id).num_lines() return self._reconstruct(lines, version_id, 0, length) def gzip_string(lines): """Compress lines using gzip.""" import gzip sio = BytesIO() with gzip.GzipFile(None, mode="wb", fileobj=sio) as data_file: data_file.writelines(lines) return sio.getvalue() bzrformats_3.4.0.orig/bzrformats/osutils.py0000644000000000000000000003765215162203117016136 0ustar00# Copyright (C) 2025 Breezy Contributors # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """OS utilities for bzrformats using only standard library.""" import hashlib import logging import os import shutil import sys import unicodedata def split(path): """Split a pathname into directory and basename parts.""" if isinstance(path, bytes): return os.path.split(path) else: # For unicode strings, encode to UTF-8, split, then decode encoded = path.encode("utf-8") dirname, basename = os.path.split(encoded) return dirname.decode("utf-8"), basename.decode("utf-8") def pathjoin(*args): """Join paths together.""" if not args: return b"" if isinstance(args[0], bytes) else "" # Check if we're dealing with bytes or strings if isinstance(args[0], bytes): return os.path.join(*args) else: # For unicode strings, encode to UTF-8, join, then decode encoded_args = [arg.encode("utf-8") for arg in args] result = os.path.join(*encoded_args) return result.decode("utf-8") def pumpfile(from_file, to_file, buffer_size=65536): """Copy data from one file-like object to another. Returns the number of bytes copied. """ initial_pos = to_file.tell() if hasattr(to_file, "tell") else 0 shutil.copyfileobj(from_file, to_file, buffer_size) if hasattr(to_file, "tell"): return to_file.tell() - initial_pos else: # If we can't tell the position, we can't return accurate byte count return 0 def chunks_to_lines(chunks): """Convert chunks to lines.""" if not chunks: return [] # Join all chunks data = b"".join(chunks) # Split into lines, keeping line endings lines = [] start = 0 for i, byte in enumerate(data): if byte == ord(b"\n"): lines.append(data[start : i + 1]) start = i + 1 # Add remaining data if any if start < len(data): lines.append(data[start:]) return lines def normalized_filename(filename): """Return the normalized form of a filename. Returns (normalized_name, can_access) tuple. """ if isinstance(filename, bytes): # For bytes, try to decode as UTF-8 first try: unicode_filename = filename.decode("utf-8") except UnicodeDecodeError: # If it's not valid UTF-8, return as-is return filename, True else: unicode_filename = filename # Normalize using NFC (Canonical Decomposition, followed by Canonical Composition) normalized = unicodedata.normalize("NFC", unicode_filename) if isinstance(filename, bytes): try: return normalized.encode("utf-8"), True except UnicodeEncodeError: return filename, True else: return normalized, True def failed_to_load_extension(exception): """Log a message about a failed extension load.""" logging.debug("Failed to load extension: %s", exception) def fdatasync(fileno): """Flush file contents to disk, not metadata.""" try: os.fdatasync(fileno) except AttributeError: # fdatasync is not available on all platforms (e.g., Windows) # Fall back to fsync which is more widely available os.fsync(fileno) def splitpath(path): """Split a path into a list of components.""" if isinstance(path, bytes): if path.startswith(b"/"): path = path[1:] if not path: return [] return path.split(b"/") else: if path.startswith("/"): path = path[1:] if not path: return [] return path.split("/") def file_kind_from_stat_mode(mode): """Return the file kind based on the stat mode.""" import stat if stat.S_ISREG(mode): return "file" elif stat.S_ISDIR(mode): return "directory" elif stat.S_ISLNK(mode): return "symlink" elif stat.S_ISFIFO(mode): return "fifo" elif stat.S_ISSOCK(mode): return "socket" elif stat.S_ISCHR(mode): return "chardev" elif stat.S_ISBLK(mode): return "block" else: return "unknown" def contains_whitespace(s): """Return True if the string contains whitespace characters.""" # Check for common whitespace characters if isinstance(s, bytes): return any(c in s for c in b" \t\n\r\v\f") else: return any(c in s for c in " \t\n\r\v\f") def sha_strings(strings): """Return the sha1 of concatenated strings.""" sha = hashlib.sha1() # noqa: S324 for string in strings: if isinstance(string, str): # Convert unicode strings to bytes using UTF-8 string = string.encode("utf-8") sha.update(string) return sha.hexdigest().encode("ascii") def sha_string(string): """Return the sha1 of a single string.""" if isinstance(string, str): # Convert unicode strings to bytes using UTF-8 string = string.encode("utf-8") sha = hashlib.sha1() # noqa: S324 sha.update(string) return sha.hexdigest().encode("ascii") def sha_file(file_obj): """Return the sha1 of a file.""" sha = hashlib.sha1() # noqa: S324 while True: chunk = file_obj.read(65536) if not chunk: break sha.update(chunk) return sha.hexdigest().encode("ascii") def dirname(path): """Return the directory part of a path.""" if isinstance(path, bytes): return os.path.dirname(path) else: # For unicode strings, encode to UTF-8, get dirname, then decode encoded = path.encode("utf-8") result = os.path.dirname(encoded) return result.decode("utf-8") def basename(path): """Return the basename part of a path.""" if isinstance(path, bytes): return os.path.basename(path) else: # For unicode strings, encode to UTF-8, get basename, then decode encoded = path.encode("utf-8") result = os.path.basename(encoded) return result.decode("utf-8") def chunks_to_lines_iter(chunks_iter): """Convert an iterator of chunks to an iterator of lines.""" buffer = b"" for chunk in chunks_iter: buffer += chunk while b"\n" in buffer: line, buffer = buffer.split(b"\n", 1) yield line + b"\n" # Yield any remaining data as the last line (without newline) if buffer: yield buffer def file_iterator(file_obj, chunk_size=65536): """Iterate over the contents of a file in chunks.""" while True: chunk = file_obj.read(chunk_size) if not chunk: break yield chunk def normalizes_filenames(): """Check if the filesystem normalizes filenames (e.g. Mac OS X).""" from . import _osutils_rs return _osutils_rs.normalizes_filenames() def rand_chars(length): """Generate a string of random characters.""" from . import _osutils_rs return _osutils_rs.rand_chars(length) class DirReader: """An interface for reading directories.""" def top_prefix_to_starting_dir(self, top, prefix=""): """Converts top and prefix to a starting dir entry. :param top: A utf8 path :param prefix: An optional utf8 path to prefix output relative paths with. :return: A tuple starting with prefix, and ending with the native encoding of top. """ raise NotImplementedError(self.top_prefix_to_starting_dir) def read_dir(self, prefix, top): """Read a specific dir. :param prefix: A utf8 prefix to be preprended to the path basenames. :param top: A natively encoded path to read. :return: A list of the directories contents. Each item contains: (utf8_relpath, utf8_name, kind, lstatvalue, native_abspath) """ raise NotImplementedError(self.read_dir) _selected_dir_reader = None def safe_unicode(unicode_or_utf8_string): """Coerce unicode_or_utf8_string into unicode. If it is unicode, it is returned. Otherwise it is decoded from utf-8. If decoding fails, the exception is wrapped in a TypeError exception. """ if isinstance(unicode_or_utf8_string, (str, os.PathLike)): return unicode_or_utf8_string try: return unicode_or_utf8_string.decode("utf8") except UnicodeDecodeError as e: raise TypeError(unicode_or_utf8_string) from e def safe_utf8(unicode_or_utf8_string): """Coerce unicode_or_utf8_string to a utf8 string. If it is a str, it is returned. If it is Unicode, it is encoded into a utf-8 string. """ if isinstance(unicode_or_utf8_string, bytes): # Make sure it is a valid utf-8 string try: unicode_or_utf8_string.decode("utf-8") except UnicodeDecodeError as e: raise TypeError(unicode_or_utf8_string) from e return unicode_or_utf8_string return unicode_or_utf8_string.encode("utf-8") def _walkdirs_utf8(top, prefix="", fs_enc=None): """Yield data about all the directories in a tree. This yields the same information as walkdirs() only each entry is yielded in utf-8. On platforms which have a filesystem encoding of utf8 the paths are returned as exact byte-strings. :return: yields a tuple of (dir_info, [file_info]) dir_info is (utf8_relpath, path-from-top) file_info is (utf8_relpath, utf8_name, kind, lstat, path-from-top) if top is an absolute path, path-from-top is also an absolute path. path-from-top might be unicode or utf8, but it is the correct path to pass to os functions to affect the file in question. (such as os.lstat) """ global _selected_dir_reader if _selected_dir_reader is None: if fs_enc is None: fs_enc = sys.getfilesystemencoding() # Always use the python version for bzrformats _selected_dir_reader = UnicodeDirReader() # 0 - relpath, 1- basename, 2- kind, 3- stat, 4-toppath # But we don't actually uses 1-3 in pending, so set them to None pending = [[_selected_dir_reader.top_prefix_to_starting_dir(top, prefix)]] read_dir = _selected_dir_reader.read_dir _directory = "directory" while pending: relroot, _, _, _, top = pending[-1].pop() if not pending[-1]: pending.pop() dirblock = sorted(read_dir(relroot, top)) yield (relroot, top), dirblock # push the user specified dirs from dirblock next = [d for d in reversed(dirblock) if d[2] == _directory] if next: pending.append(next) class UnicodeDirReader(DirReader): """A dir reader for non-utf8 file systems, which transcodes.""" __slots__ = ["_utf8_encode"] def __init__(self): """Initialize the UTF-8 directory reader.""" import codecs self._utf8_encode = codecs.getencoder("utf8") def top_prefix_to_starting_dir(self, top, prefix=""): """See DirReader.top_prefix_to_starting_dir.""" return (safe_utf8(prefix), None, None, None, safe_unicode(top)) def read_dir(self, prefix, top): """Read a single directory from a non-utf8 file system. top, and the abspath element in the output are unicode, all other paths are utf8. Local disk IO is done via unicode calls to listdir etc. This is currently the fallback code path when the filesystem encoding is not UTF-8. It may be better to implement an alternative so that we can safely handle paths that are not properly decodable in the current encoding. See DirReader.read_dir for details. """ _utf8_encode = self._utf8_encode relprefix = prefix + b"/" if prefix else b"" top_slash = top + "/" dirblock = [] append = dirblock.append for entry in os.scandir(top): name = os.fsdecode(entry.name) abspath = top_slash + name name_utf8 = _utf8_encode(name, "surrogateescape")[0] statvalue = entry.stat(follow_symlinks=False) kind = file_kind_from_stat_mode(statvalue.st_mode) append((relprefix + name_utf8, name_utf8, kind, statvalue, abspath)) return sorted(dirblock) def is_inside(dir, fname): """Check if fname is inside dir. The empty string as dir is considered to contain everything. A path is considered to be inside itself. :param dir: Directory path (bytes or str) :param fname: File path to check (bytes or str) :return: True if fname is inside dir """ # Normalize to use bytes for comparison if isinstance(dir, str): dir = dir.encode("utf-8") if isinstance(fname, str): fname = fname.encode("utf-8") if dir == fname: return True # Ensure trailing slash for proper comparison if dir != b"": dir = dir.rstrip(b"/") + b"/" return fname.startswith(dir) def is_inside_any(dir_list, fname): """Check if fname is inside any of the directories in dir_list. :param dir_list: List of directory paths :param fname: File path to check :return: True if fname is inside any directory in dir_list """ return any(is_inside(dir, fname) for dir in dir_list) def parent_directories(filename): """Return a list of parent directories of filename. :param filename: Path (bytes or str) :return: List of parent directory paths """ from . import _osutils_rs if isinstance(filename, bytes): filename = filename.decode("utf-8") return _osutils_rs.parent_directories(filename) def split_lines(text): r"""Split text into lines, keeping line endings. Args: text: bytes to split Returns: List of byte strings, each ending with \\n where appropriate """ from . import _osutils_rs return _osutils_rs.split_lines(text) class IterableFile: """A file-like object backed by an iterator of byte strings. Supports ``read()`` and ``readline()`` over a lazy sequence of chunks. """ def __init__(self, iterable): """Initialize with an iterable of byte chunks.""" self._iter = iter(iterable) self._buf = b"" def read(self, size=-1): """Read up to *size* bytes, or all remaining if *size* < 0.""" if size < 0: return self._buf + b"".join(self._iter) while len(self._buf) < size: try: self._buf += next(self._iter) except StopIteration: break result = self._buf[:size] self._buf = self._buf[size:] return result def readline(self): r"""Read one line (up to and including ``\\n``).""" while b"\n" not in self._buf: try: self._buf += next(self._iter) except StopIteration: # Return whatever is left result = self._buf self._buf = b"" return result idx = self._buf.index(b"\n") + 1 result = self._buf[:idx] self._buf = self._buf[idx:] return result def readlines(self): """Return all remaining lines as a list.""" lines = [] while True: line = self.readline() if not line: break lines.append(line) return lines def __iter__(self): """Iterate over lines.""" while True: line = self.readline() if not line: break yield line bzrformats_3.4.0.orig/bzrformats/pack.py0000644000000000000000000005406015162073400015342 0ustar00# Copyright (C) 2007, 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Container format for Bazaar data. "Containers" and "records" are described in doc/developers/container-format.txt. """ import re from io import BytesIO from . import errors FORMAT_ONE = b"Bazaar pack format 1 (introduced in 0.18)" _whitespace_re = re.compile(b"[\t\n\x0b\x0c\r ]") class ContainerError(errors.BzrFormatsError): """Base class of container errors.""" class UnknownContainerFormatError(ContainerError): """Exception raised when encountering unknown container format.""" _fmt = "Unrecognised container format: %(container_format)r" def __init__(self, container_format): """Initialize UnknownContainerFormatError. Args: container_format: The unknown container format encountered. """ self.container_format = container_format class UnexpectedEndOfContainerError(ContainerError): """Exception raised when container stream ends unexpectedly.""" _fmt = "Unexpected end of container stream" class UnknownRecordTypeError(ContainerError): """Exception raised when encountering unknown record type.""" _fmt = "Unknown record type: %(record_type)r" def __init__(self, record_type): """Initialize UnknownRecordTypeError. Args: record_type: The unknown record type encountered. """ self.record_type = record_type class InvalidRecordError(ContainerError): """Exception raised when a record is invalid.""" _fmt = "Invalid record: %(reason)s" def __init__(self, reason): """Initialize InvalidRecordError. Args: reason: The reason the record is invalid. """ self.reason = reason class ContainerHasExcessDataError(ContainerError): """Exception raised when container has excess data after end marker.""" _fmt = "Container has data after end marker: %(excess)r" def __init__(self, excess): """Initialize ContainerHasExcessDataError. Args: excess: The excess data found after end marker. """ self.excess = excess class DuplicateRecordNameError(ContainerError): """Exception raised when container has duplicate record names.""" _fmt = "Container has multiple records with the same name: %(name)s" def __init__(self, name): """Initialize DuplicateRecordNameError. Args: name: The duplicate record name. """ self.name = name.decode("utf-8") def _check_name(name): """Do some basic checking of 'name'. At the moment, this just checks that there are no whitespace characters in a name. :raises InvalidRecordError: if name is not valid. :seealso: _check_name_encoding """ if _whitespace_re.search(name) is not None: raise InvalidRecordError(f"{name!r} is not a valid name.") def _check_name_encoding(name): """Check that 'name' is valid UTF-8. This is separate from _check_name because UTF-8 decoding is relatively expensive, and we usually want to avoid it. :raises InvalidRecordError: if name is not valid UTF-8. """ try: name.decode("utf-8") except UnicodeDecodeError as e: raise InvalidRecordError(str(e)) from e class ContainerSerialiser: """A helper class for serialising containers. It simply returns bytes from method calls to 'begin', 'end' and 'bytes_record'. You may find ContainerWriter to be a more convenient interface. """ def begin(self): """Return the bytes to begin a container.""" return FORMAT_ONE + b"\n" def end(self): """Return the bytes to finish a container.""" return b"E" def bytes_header(self, length, names): """Return the header for a Bytes record.""" # Kind marker byte_sections = [b"B"] # Length byte_sections.append(b"%d\n" % (length,)) # Names for name_tuple in names: # Make sure we're writing valid names. Note that we will leave a # half-written record if a name is bad! for name in name_tuple: _check_name(name) byte_sections.append(b"\x00".join(name_tuple) + b"\n") # End of headers byte_sections.append(b"\n") return b"".join(byte_sections) def bytes_record(self, bytes, names): """Return the bytes for a Bytes record with the given name and contents. If the content may be large, construct the header separately and then stream out the contents. """ return self.bytes_header(len(bytes), names) + bytes class ContainerWriter: """A class for writing containers to a file. :attribute records_written: The number of user records added to the container. This does not count the prelude or suffix of the container introduced by the begin() and end() methods. """ # Join up headers with the body if writing fewer than this many bytes: # trades off memory usage and copying to do less IO ops. _JOIN_WRITES_THRESHOLD = 100000 def __init__(self, write_func): """Constructor. :param write_func: a callable that will be called when this ContainerWriter needs to write some bytes. """ self._write_func = write_func self.current_offset = 0 self.records_written = 0 self._serialiser = ContainerSerialiser() def begin(self): """Begin writing a container.""" self.write_func(self._serialiser.begin()) def write_func(self, bytes): """Write bytes to the container. Args: bytes: The bytes to write. """ self._write_func(bytes) self.current_offset += len(bytes) def end(self): """Finish writing a container.""" self.write_func(self._serialiser.end()) def add_bytes_record(self, chunks, length, names): """Add a Bytes record with the given names. :param bytes: The chunks to insert. :param length: Total length of bytes in chunks :param names: The names to give the inserted bytes. Each name is a tuple of bytestrings. The bytestrings may not contain whitespace. :return: An offset, length tuple. The offset is the offset of the record within the container, and the length is the length of data that will need to be read to reconstitute the record. These offset and length can only be used with the pack interface - they might be offset by headers or other such details and thus are only suitable for use by a ContainerReader. """ current_offset = self.current_offset if length < self._JOIN_WRITES_THRESHOLD: self.write_func( self._serialiser.bytes_header(length, names) + b"".join(chunks) ) else: self.write_func(self._serialiser.bytes_header(length, names)) for chunk in chunks: self.write_func(chunk) self.records_written += 1 # return a memo of where we wrote data to allow random access. return current_offset, self.current_offset - current_offset class ReadVFile: """Adapt a readv result iterator to a file like protocol. The readv result must support the iterator protocol returning (offset, data_bytes) pairs. """ # XXX: This could be a generic transport class, as other code may want to # gradually consume the readv result. def __init__(self, readv_result): """Construct a new ReadVFile wrapper. :seealso: make_readv_reader :param readv_result: the most recent readv result - list or generator """ # readv can return a sequence or an iterator, but we require an # iterator to know how much has been consumed. readv_result = iter(readv_result) self.readv_result = readv_result self._string = None def _next(self): if self._string is None or self._string.tell() == self._string_length: _offset, data = next(self.readv_result) self._string_length = len(data) self._string = BytesIO(data) def read(self, length): """Read specified number of bytes from the current string. Args: length: Number of bytes to read. Returns: The bytes read. Raises: BzrError: If insufficient bytes are available. """ self._next() result = self._string.read(length) if len(result) < length: raise errors.BzrFormatsError( "wanted %d bytes but next " "hunk only contains %d: %r..." % (length, len(result), result[:20]) ) return result def readline(self): """Note that readline will not cross readv segments.""" self._next() result = self._string.readline() if self._string.tell() == self._string_length and result[-1:] != b"\n": raise errors.BzrFormatsError( f"short readline in the readvfile hunk: {result!r}" ) return result def make_readv_reader(transport, filename, requested_records): """Create a ContainerReader that will read selected records only. :param transport: The transport the pack file is located on. :param filename: The filename of the pack file. :param requested_records: The record offset, length tuples as returned by add_bytes_record for the desired records. """ readv_blocks = [(0, len(FORMAT_ONE) + 1)] readv_blocks.extend(requested_records) result = ContainerReader(ReadVFile(transport.readv(filename, readv_blocks))) return result class BaseReader: """Base class for reading container data from files.""" def __init__(self, source_file): """Constructor. :param source_file: a file-like object with `read` and `readline` methods. """ self._source = source_file def reader_func(self, length=None): """Read data from the source file. Args: length: Optional number of bytes to read. If None, reads all available. Returns: The bytes read from the source. """ return self._source.read(length) def _read_line(self): line = self._source.readline() if not line.endswith(b"\n"): raise UnexpectedEndOfContainerError() return line.rstrip(b"\n") class ContainerReader(BaseReader): """A class for reading Bazaar's container format.""" def iter_records(self): """Iterate over the container, yielding each record as it is read. Each yielded record will be a 2-tuple of (names, callable), where names is a ``list`` and bytes is a function that takes one argument, ``max_length``. You **must not** call the callable after advancing the iterator to the next record. That is, this code is invalid:: record_iter = container.iter_records() names1, callable1 = record_iter.next() names2, callable2 = record_iter.next() bytes1 = callable1(None) As it will give incorrect results and invalidate the state of the ContainerReader. :raises ContainerError: if any sort of container corruption is detected, e.g. UnknownContainerFormatError is the format of the container is unrecognised. :seealso: ContainerReader.read """ self._read_format() return self._iter_records() def iter_record_objects(self): """Iterate over the container, yielding each record as it is read. Each yielded record will be an object with ``read`` and ``validate`` methods. Like with iter_records, it is not safe to use a record object after advancing the iterator to yield next record. :raises ContainerError: if any sort of container corruption is detected, e.g. UnknownContainerFormatError is the format of the container is unrecognised. :seealso: iter_records """ self._read_format() return self._iter_record_objects() def _iter_records(self): for record in self._iter_record_objects(): yield record.read() def _iter_record_objects(self): while True: try: record_kind = self.reader_func(1) except StopIteration: return if record_kind == b"B": # Bytes record. reader = BytesRecordReader(self._source) yield reader elif record_kind == b"E": # End marker. There are no more records. return elif record_kind == b"": # End of stream encountered, but no End Marker record seen, so # this container is incomplete. raise UnexpectedEndOfContainerError() else: # Unknown record type. raise UnknownRecordTypeError(record_kind) def _read_format(self): format = self._read_line() if format != FORMAT_ONE: raise UnknownContainerFormatError(format) def validate(self): """Validate this container and its records. Validating consumes the data stream just like iter_records and iter_record_objects, so you cannot call it after iter_records/iter_record_objects. :raises ContainerError: if something is invalid. """ all_names = set() for record_names, read_bytes in self.iter_records(): read_bytes(None) for name_tuple in record_names: for name in name_tuple: _check_name_encoding(name) # Check that the name is unique. Note that Python will refuse # to decode non-shortest forms of UTF-8 encoding, so there is no # risk that the same unicode string has been encoded two # different ways. if name_tuple in all_names: raise DuplicateRecordNameError(name_tuple[0]) all_names.add(name_tuple) excess_bytes = self.reader_func(1) if excess_bytes != b"": raise ContainerHasExcessDataError(excess_bytes) class BytesRecordReader(BaseReader): """Reader for bytes records in container format.""" def read(self): """Read this record. You can either validate or read a record, you can't do both. :returns: A tuple of (names, callable). The callable can be called repeatedly to obtain the bytes for the record, with a max_length argument. If max_length is None, returns all the bytes. Because records can be arbitrarily large, using None is not recommended unless you have reason to believe the content will fit in memory. """ # Read the content length. length_line = self._read_line() try: length = int(length_line) except ValueError as e: raise InvalidRecordError(f"{length_line!r} is not a valid length.") from e # Read the list of names. names = [] while True: name_line = self._read_line() if name_line == b"": break name_tuple = tuple(name_line.split(b"\x00")) for name in name_tuple: _check_name(name) names.append(name_tuple) self._remaining_length = length return names, self._content_reader def _content_reader(self, max_length): if max_length is None: length_to_read = self._remaining_length else: length_to_read = min(max_length, self._remaining_length) self._remaining_length -= length_to_read bytes = self.reader_func(length_to_read) if len(bytes) != length_to_read: raise UnexpectedEndOfContainerError() return bytes def validate(self): """Validate this record. You can either validate or read, you can't do both. :raises ContainerError: if this record is invalid. """ names, read_bytes = self.read() for name_tuple in names: for name in name_tuple: _check_name_encoding(name) read_bytes(None) class ContainerPushParser: """A "push" parser for container format 1. It accepts bytes via the ``accept_bytes`` method, and parses them into records which can be retrieved via the ``read_pending_records`` method. """ def __init__(self): """Initialize a new ContainerPushParser.""" self._buffer = b"" self._state_handler = self._state_expecting_format_line self._parsed_records = [] self._reset_current_record() self.finished = False def _reset_current_record(self): self._current_record_length = None self._current_record_names = [] def accept_bytes(self, bytes): """Accept additional bytes and parse them. Args: bytes: New bytes to add to the parsing buffer. """ self._buffer += bytes # Keep iterating the state machine until it stops consuming bytes from # the buffer. last_buffer_length = None cur_buffer_length = len(self._buffer) last_state_handler = None while ( cur_buffer_length != last_buffer_length or last_state_handler != self._state_handler ): last_buffer_length = cur_buffer_length last_state_handler = self._state_handler self._state_handler() cur_buffer_length = len(self._buffer) def read_pending_records(self, max=None): """Read parsed records from the buffer. Args: max: Maximum number of records to return. If None, returns all. Returns: List of parsed records. """ if max: records = self._parsed_records[:max] del self._parsed_records[:max] return records else: records = self._parsed_records self._parsed_records = [] return records def _consume_line(self): """Take a line out of the buffer, and return the line. If a newline byte is not found in the buffer, the buffer is unchanged and this returns None instead. """ newline_pos = self._buffer.find(b"\n") if newline_pos != -1: line = self._buffer[:newline_pos] self._buffer = self._buffer[newline_pos + 1 :] return line else: return None def _state_expecting_format_line(self): line = self._consume_line() if line is not None: if line != FORMAT_ONE: raise UnknownContainerFormatError(line) self._state_handler = self._state_expecting_record_type def _state_expecting_record_type(self): if len(self._buffer) >= 1: record_type = self._buffer[:1] self._buffer = self._buffer[1:] if record_type == b"B": self._state_handler = self._state_expecting_length elif record_type == b"E": self.finished = True self._state_handler = self._state_expecting_nothing else: raise UnknownRecordTypeError(record_type) def _state_expecting_length(self): line = self._consume_line() if line is not None: try: self._current_record_length = int(line) except ValueError as e: raise InvalidRecordError(f"{line!r} is not a valid length.") from e self._state_handler = self._state_expecting_name def _state_expecting_name(self): encoded_name_parts = self._consume_line() if encoded_name_parts == b"": self._state_handler = self._state_expecting_body elif encoded_name_parts: name_parts = tuple(encoded_name_parts.split(b"\x00")) for name_part in name_parts: _check_name(name_part) self._current_record_names.append(name_parts) def _state_expecting_body(self): if len(self._buffer) >= self._current_record_length: body_bytes = self._buffer[: self._current_record_length] self._buffer = self._buffer[self._current_record_length :] record = (self._current_record_names, body_bytes) self._parsed_records.append(record) self._reset_current_record() self._state_handler = self._state_expecting_record_type def _state_expecting_nothing(self): pass def read_size_hint(self): """Get a hint for how many bytes should be read next. Returns: Number of bytes that should be read for optimal parsing. """ hint = 16384 if self._state_handler == self._state_expecting_body: remaining = self._current_record_length - len(self._buffer) if remaining < 0: remaining = 0 return max(hint, remaining) return hint def iter_records_from_file(source_file): """Iterate over records from a file. Args: source_file: File-like object to read from. Yields: Records from the container file. """ parser = ContainerPushParser() while True: bytes = source_file.read(parser.read_size_hint()) parser.accept_bytes(bytes) yield from parser.read_pending_records() if parser.finished: break bzrformats_3.4.0.orig/bzrformats/pack_repo.py0000644000000000000000000007331315162115107016372 0ustar00# Copyright (C) 2007-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Pack repository objects.""" import hashlib import logging import sys import time logger = logging.getLogger(__name__) from . import btree_index from . import pack as _mod_pack from .errors import BzrCheckError, BzrFormatsError from .transport import TransportNoSuchFile class RetryWithNewPacks(BzrFormatsError): """Raised when we realize that the packs on disk have changed. This is meant as more of a signaling exception, to trap between where a local error occurred and the code that can actually handle the error and code that can retry appropriately. """ internal_error = True _fmt = ( "Pack files have changed, reload and retry. context: %(context)s %(orig_error)s" ) def __init__(self, context, reload_occurred, exc_info): """Create a new RetryWithNewPacks error. :param reload_occurred: Set to True if we know that the packs have already been reloaded, and we are failing because of an in-memory cache miss. If set to True then we will ignore if a reload says nothing has changed, because we assume it has already reloaded. If False, then a reload with nothing changed will force an error. :param exc_info: The original exception traceback, so if there is a problem we can raise the original error (value from sys.exc_info()) """ BzrFormatsError.__init__(self) self.context = context self.reload_occurred = reload_occurred self.exc_info = exc_info self.orig_error = exc_info[1] # TODO: The global error handler should probably treat this by # raising/printing the original exception with a bit about # RetryWithNewPacks also not being caught class _DirectPackAccess: """Access to data in one or more packs with less translation.""" def __init__(self, index_to_packs, reload_func=None, flush_func=None): """Create a _DirectPackAccess object. :param index_to_packs: A dict mapping index objects to the transport and file names for obtaining data. :param reload_func: A function to call if we determine that the pack files have moved and we need to reload our caches. See breezy.repo_fmt.pack_repo.AggregateIndex for more details. """ self._container_writer = None self._write_index = None self._indices = index_to_packs self._reload_func = reload_func self._flush_func = flush_func def add_raw_record(self, key, size, raw_data): """Add raw knit bytes to a storage area. The data is spooled to the container writer in one bytes-record per raw data item. :param key: key of the data segment :param size: length of the data segment :param raw_data: A bytestring containing the data. :return: An opaque index memo For _DirectPackAccess the memo is (index, pos, length), where the index field is the write_index object supplied to the PackAccess object. """ p_offset, p_length = self._container_writer.add_bytes_record(raw_data, size, []) return (self._write_index, p_offset, p_length) def add_raw_records(self, key_sizes, raw_data): """Add raw knit bytes to a storage area. The data is spooled to the container writer in one bytes-record per raw data item. :param sizes: An iterable of tuples containing the key and size of each raw data segment. :param raw_data: A bytestring containing the data. :return: A list of memos to retrieve the record later. Each memo is an opaque index memo. For _DirectPackAccess the memo is (index, pos, length), where the index field is the write_index object supplied to the PackAccess object. """ raw_data = b"".join(raw_data) if not isinstance(raw_data, bytes): raise AssertionError(f"data must be plain bytes was {type(raw_data)}") result = [] offset = 0 for key, size in key_sizes: result.append( self.add_raw_record(key, size, [raw_data[offset : offset + size]]) ) offset += size return result def flush(self): """Flush pending writes on this access object. This will flush any buffered writes to a NewPack. """ if self._flush_func is not None: self._flush_func() def get_raw_records(self, memos_for_retrieval): """Get the raw bytes for a records. :param memos_for_retrieval: An iterable containing the (index, pos, length) memo for retrieving the bytes. The Pack access method looks up the pack to use for a given record in its index_to_pack map. :return: An iterator over the bytes of the records. """ # first pass, group into same-index requests request_lists = [] current_index = None for index, offset, length in memos_for_retrieval: if current_index == index: current_list.append((offset, length)) else: if current_index is not None: request_lists.append((current_index, current_list)) current_index = index current_list = [(offset, length)] # handle the last entry if current_index is not None: request_lists.append((current_index, current_list)) for index, offsets in request_lists: try: transport, path = self._indices[index] except KeyError as e: # A KeyError here indicates that someone has triggered an index # reload, and this index has gone missing, we need to start # over. if self._reload_func is None: # If we don't have a _reload_func there is nothing that can # be done raise raise RetryWithNewPacks( index, reload_occurred=True, exc_info=sys.exc_info() ) from e try: reader = _mod_pack.make_readv_reader(transport, path, offsets) for _names, read_func in reader.iter_records(): yield read_func(None) except TransportNoSuchFile as e: # A NoSuchFile error indicates that a pack file has gone # missing on disk, we need to trigger a reload, and start over. if self._reload_func is None: raise raise RetryWithNewPacks( transport.abspath(path), reload_occurred=False, exc_info=sys.exc_info(), ) from e def set_writer(self, writer, index, transport_packname): """Set a writer to use for adding data.""" if index is not None: self._indices[index] = transport_packname self._container_writer = writer self._write_index = index def reload_or_raise(self, retry_exc): """Try calling the reload function, or re-raise the original exception. This should be called after _DirectPackAccess raises a RetryWithNewPacks exception. This function will handle the common logic of determining when the error is fatal versus being temporary. It will also make sure that the original exception is raised, rather than the RetryWithNewPacks exception. If this function returns, then the calling function should retry whatever operation was being performed. Otherwise an exception will be raised. :param retry_exc: A RetryWithNewPacks exception. """ is_error = False if self._reload_func is None: is_error = True elif not self._reload_func(): # The reload claimed that nothing changed if not retry_exc.reload_occurred: # If there wasn't an earlier reload, then we really were # expecting to find changes. We didn't find them, so this is a # hard error is_error = True if is_error: # GZ 2017-03-27: No real reason this needs the original traceback. raise retry_exc.exc_info[1] class Pack: """An in memory proxy for a pack and its indices. This is a base class that is not directly used, instead the classes ExistingPack and NewPack are used. """ # A map of index 'type' to the file extension and position in the # index_sizes array. index_definitions = { "chk": (".cix", 4), "revision": (".rix", 0), "inventory": (".iix", 1), "text": (".tix", 2), "signature": (".six", 3), } def __init__( self, revision_index, inventory_index, text_index, signature_index, chk_index=None, ): """Create a pack instance. :param revision_index: A GraphIndex for determining what revisions are present in the Pack and accessing the locations of their texts. :param inventory_index: A GraphIndex for determining what inventories are present in the Pack and accessing the locations of their texts/deltas. :param text_index: A GraphIndex for determining what file texts are present in the pack and accessing the locations of their texts/deltas (via (fileid, revisionid) tuples). :param signature_index: A GraphIndex for determining what signatures are present in the Pack and accessing the locations of their texts. :param chk_index: A GraphIndex for accessing content by CHK, if the pack has one. """ self.revision_index = revision_index self.inventory_index = inventory_index self.text_index = text_index self.signature_index = signature_index self.chk_index = chk_index def access_tuple(self): """Return a tuple (transport, name) for the pack content.""" return self.pack_transport, self.file_name() def _check_references(self): """Make sure our external references are present. Packs are allowed to have deltas whose base is not in the pack, but it must be present somewhere in this collection. It is not allowed to have deltas based on a fallback repository. (See ) """ missing_items = {} for index_name, external_refs, index in [ ( "texts", self._get_external_refs(self.text_index), self._pack_collection.text_index.combined_index, ), ( "inventories", self._get_external_refs(self.inventory_index), self._pack_collection.inventory_index.combined_index, ), ]: missing = external_refs.difference( k for (idx, k, v, r) in index.iter_entries(external_refs) ) if missing: missing_items[index_name] = sorted(missing) if missing_items: from pprint import pformat raise BzrCheckError( f"Newly created pack file {self!r} has delta references to " f"items not in its repository:\n{pformat(missing_items)}" ) def file_name(self): """Get the file name for the pack on disk.""" return self.name + ".pack" def get_revision_count(self): """Return the number of revisions in this pack.""" return self.revision_index.key_count() def index_name(self, index_type, name): """Get the disk name of an index type for pack name 'name'.""" return name + Pack.index_definitions[index_type][0] def index_offset(self, index_type): """Get the position in a index_size array for a given index type.""" return Pack.index_definitions[index_type][1] def inventory_index_name(self, name): """The inv index is the name + .iix.""" return self.index_name("inventory", name) def revision_index_name(self, name): """The revision index is the name + .rix.""" return self.index_name("revision", name) def signature_index_name(self, name): """The signature index is the name + .six.""" return self.index_name("signature", name) def text_index_name(self, name): """The text index is the name + .tix.""" return self.index_name("text", name) def _replace_index_with_readonly(self, index_type): unlimited_cache = False if index_type == "chk": unlimited_cache = True index = self.index_class( self.index_transport, self.index_name(index_type, self.name), self.index_sizes[self.index_offset(index_type)], unlimited_cache=unlimited_cache, ) if index_type == "chk": index._leaf_factory = btree_index._gcchk_factory setattr(self, index_type + "_index", index) def __lt__(self, other): """Compare packs by identity for ordering.""" if not isinstance(other, Pack): raise TypeError(other) return id(self) < id(other) def __hash__(self): """Return hash based on index objects.""" return hash( ( type(self), self.revision_index, self.inventory_index, self.text_index, self.signature_index, self.chk_index, ) ) class ExistingPack(Pack): """An in memory proxy for an existing .pack and its disk indices.""" def __init__( self, pack_transport, name, revision_index, inventory_index, text_index, signature_index, chk_index=None, ): """Create an ExistingPack object. :param pack_transport: The transport where the pack file resides. :param name: The name of the pack on disk in the pack_transport. """ Pack.__init__( self, revision_index, inventory_index, text_index, signature_index, chk_index, ) self.name = name self.pack_transport = pack_transport if None in ( revision_index, inventory_index, text_index, signature_index, name, pack_transport, ): raise AssertionError() def __eq__(self, other): """Check equality by comparing all attributes.""" return self.__dict__ == other.__dict__ def __ne__(self, other): """Check inequality.""" return not self.__eq__(other) def __repr__(self): """Return string representation.""" return "<{}.{} object at 0x{:x}, {}, {}".format( self.__class__.__module__, self.__class__.__name__, id(self), self.pack_transport, self.name, ) def __hash__(self): """Return hash based on type and name.""" return hash((type(self), self.name)) class ResumedPack(ExistingPack): """A pack being resumed from an interrupted upload.""" def __init__( self, name, revision_index, inventory_index, text_index, signature_index, upload_transport, pack_transport, index_transport, pack_collection, chk_index=None, ): """Create a ResumedPack object.""" ExistingPack.__init__( self, pack_transport, name, revision_index, inventory_index, text_index, signature_index, chk_index=chk_index, ) self.upload_transport = upload_transport self.index_transport = index_transport self.index_sizes = [None, None, None, None] indices = [ ("revision", revision_index), ("inventory", inventory_index), ("text", text_index), ("signature", signature_index), ] if chk_index is not None: indices.append(("chk", chk_index)) self.index_sizes.append(None) for index_type, index in indices: offset = self.index_offset(index_type) self.index_sizes[offset] = index._size self.index_class = pack_collection._index_class self._pack_collection = pack_collection self._state = "resumed" # XXX: perhaps check that the .pack file exists? def access_tuple(self): """Return the transport and file name for accessing the pack data.""" if self._state == "finished": return Pack.access_tuple(self) elif self._state == "resumed": return self.upload_transport, self.file_name() else: raise AssertionError(self._state) def abort(self): """Abort the resumed pack, deleting its files.""" self.upload_transport.delete(self.file_name()) indices = [ self.revision_index, self.inventory_index, self.text_index, self.signature_index, ] if self.chk_index is not None: indices.append(self.chk_index) for index in indices: index._transport.delete(index._name) def finish(self): """Finish the resumed pack, moving files into place.""" self._check_references() index_types = ["revision", "inventory", "text", "signature"] if self.chk_index is not None: index_types.append("chk") for index_type in index_types: old_name = self.index_name(index_type, self.name) new_name = "../indices/" + old_name self.upload_transport.move(old_name, new_name) self._replace_index_with_readonly(index_type) new_name = "../packs/" + self.file_name() self.upload_transport.move(self.file_name(), new_name) self._state = "finished" def _get_external_refs(self, index): """Return compression parents for this index that are not present. This returns any compression parents that are referenced by this index, which are not contained *in* this index. They may be present elsewhere. """ return index.external_references(1) class NewPack(Pack): """An in memory proxy for a pack which is being created.""" def __init__(self, pack_collection, upload_suffix="", file_mode=None): """Create a NewPack instance. :param pack_collection: A PackCollection into which this is being inserted. :param upload_suffix: An optional suffix to be given to any temporary files created during the pack creation. e.g '.autopack' :param file_mode: Unix permissions for newly created file. """ # The relative locations of the packs are constrained, but all are # passed in because the caller has them, so as to avoid object churn. index_builder_class = pack_collection._index_builder_class if pack_collection.chk_index is not None: chk_index = index_builder_class(reference_lists=0) else: chk_index = None Pack.__init__( self, # Revisions: parents list, no text compression. index_builder_class(reference_lists=1), # Inventory: We want to map compression only, but currently the # knit code hasn't been updated enough to understand that, so we # have a regular 2-list index giving parents and compression # source. index_builder_class(reference_lists=2), # Texts: compression and per file graph, for all fileids - so two # reference lists and two elements in the key tuple. index_builder_class(reference_lists=2, key_elements=2), # Signatures: Just blobs to store, no compression, no parents # listing. index_builder_class(reference_lists=0), # CHK based storage - just blobs, no compression or parents. chk_index=chk_index, ) self._pack_collection = pack_collection # When we make readonly indices, we need this. self.index_class = pack_collection._index_class # where should the new pack be opened self.upload_transport = pack_collection._upload_transport # where are indices written out to self.index_transport = pack_collection._index_transport # where is the pack renamed to when it is finished? self.pack_transport = pack_collection._pack_transport # What file mode to upload the pack and indices with. self._file_mode = file_mode # tracks the content written to the .pack file. self._hash = hashlib.md5() # noqa: S324 # a tuple with the length in bytes of the indices, once the pack # is finalised. (rev, inv, text, sigs, chk_if_in_use) self.index_sizes = None # How much data to cache when writing packs. Note that this is not # synchronised with reads, because it's not in the transport layer, so # is not safe unless the client knows it won't be reading from the pack # under creation. self._cache_limit = 0 # the temporary pack file name. from .osutils import rand_chars self.random_name = rand_chars(20) + upload_suffix # when was this pack started ? self.start_time = time.time() # open an output stream for the data added to the pack. self.write_stream = self.upload_transport.open_write_stream( self.random_name, mode=self._file_mode ) logger.debug( "%s: create_pack: pack stream open: %s%s t+%6.3fs", time.ctime(), self.upload_transport.base, self.random_name, time.time() - self.start_time, ) # A list of byte sequences to be written to the new pack, and the # aggregate size of them. Stored as a list rather than separate # variables so that the _write_data closure below can update them. self._buffer = [[], 0] # create a callable for adding data # # robertc says- this is a closure rather than a method on the object # so that the variables are locals, and faster than accessing object # members. def _write_data( bytes, flush=False, _buffer=self._buffer, _write=self.write_stream.write, _update=self._hash.update, ): _buffer[0].append(bytes) _buffer[1] += len(bytes) # buffer cap if _buffer[1] > self._cache_limit or flush: bytes = b"".join(_buffer[0]) _write(bytes) _update(bytes) _buffer[:] = [[], 0] # expose this on self, for the occasion when clients want to add data. self._write_data = _write_data # a pack writer object to serialise pack records. self._writer = _mod_pack.ContainerWriter(self._write_data) self._writer.begin() # what state is the pack in? (open, finished, aborted) self._state = "open" # no name until we finish writing the content self.name = None def abort(self): """Cancel creating this pack.""" self._state = "aborted" self.write_stream.close() # Remove the temporary pack file. self.upload_transport.delete(self.random_name) # The indices have no state on disk. def access_tuple(self): """Return a tuple (transport, name) for the pack content.""" if self._state == "finished": return Pack.access_tuple(self) elif self._state == "open": return self.upload_transport, self.random_name else: raise AssertionError(self._state) def data_inserted(self): """True if data has been added to this pack.""" return bool( self.get_revision_count() or self.inventory_index.key_count() or self.text_index.key_count() or self.signature_index.key_count() or (self.chk_index is not None and self.chk_index.key_count()) ) def finish_content(self): """Finalize the pack content and compute the content hash name.""" if self.name is not None: return self._writer.end() if self._buffer[1]: self._write_data(b"", flush=True) self.name = self._hash.hexdigest() def finish(self, suspend=False): """Finish the new pack. This: - finalises the content - assigns a name (the md5 of the content, currently) - writes out the associated indices - renames the pack into place. - stores the index size tuple for the pack in the index_sizes attribute. """ self.finish_content() if not suspend: self._check_references() # write indices # XXX: It'd be better to write them all to temporary names, then # rename them all into place, so that the window when only some are # visible is smaller. On the other hand none will be seen until # they're in the names list. self.index_sizes = [None, None, None, None] self._write_index("revision", self.revision_index, "revision", suspend) self._write_index("inventory", self.inventory_index, "inventory", suspend) self._write_index("text", self.text_index, "file texts", suspend) self._write_index( "signature", self.signature_index, "revision signatures", suspend ) if self.chk_index is not None: self.index_sizes.append(None) self._write_index("chk", self.chk_index, "content hash bytes", suspend) self.write_stream.close( want_fdatasync=self._pack_collection.config_stack.get( "repository.fdatasync" ) ) # Note that this will clobber an existing pack with the same name, # without checking for hash collisions. While this is undesirable this # is something that can be rectified in a subsequent release. One way # to rectify it may be to leave the pack at the original name, writing # its pack-names entry as something like 'HASH: index-sizes # temporary-name'. Allocate that and check for collisions, if it is # collision free then rename it into place. If clients know this scheme # they can handle missing-file errors by: # - try for HASH.pack # - try for temporary-name # - refresh the pack-list to see if the pack is now absent new_name = self.name + ".pack" if not suspend: new_name = "../packs/" + new_name self.upload_transport.move(self.random_name, new_name) self._state = "finished" # XXX: size might be interesting? logger.debug( "%s: create_pack: pack finished: %s%s->%s t+%6.3fs", time.ctime(), self.upload_transport.base, self.random_name, new_name, time.time() - self.start_time, ) def flush(self): """Flush any current data.""" if self._buffer[1]: bytes = b"".join(self._buffer[0]) self.write_stream.write(bytes) self._hash.update(bytes) self._buffer[:] = [[], 0] def _get_external_refs(self, index): return index._external_references() def set_write_cache_size(self, size): """Set the write cache size in bytes.""" self._cache_limit = size def _write_index(self, index_type, index, label, suspend=False): """Write out an index. :param index_type: The type of index to write - e.g. 'revision'. :param index: The index object to serialise. :param label: What label to give the index e.g. 'revision'. """ index_name = self.index_name(index_type, self.name) transport = self.upload_transport if suspend else self.index_transport index_tempfile = index.finish() index_bytes = index_tempfile.read() write_stream = transport.open_write_stream(index_name, mode=self._file_mode) write_stream.write(index_bytes) write_stream.close( want_fdatasync=self._pack_collection.config_stack.get( "repository.fdatasync" ) ) self.index_sizes[self.index_offset(index_type)] = len(index_bytes) # XXX: size might be interesting? logger.debug( "%s: create_pack: wrote %s index: %s%s t+%6.3fs", time.ctime(), label, self.upload_transport.base, self.random_name, time.time() - self.start_time, ) # Replace the writable index on this object with a readonly, # presently unloaded index. We should alter # the index layer to make its finish() error if add_node is # subsequently used. RBC self._replace_index_with_readonly(index_type) bzrformats_3.4.0.orig/bzrformats/progress.py0000644000000000000000000000273215162115103016264 0ustar00# Copyright (C) 2025 Breezy Contributors # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Minimal progress bar protocol for bzrformats.""" from typing import Protocol, runtime_checkable @runtime_checkable class ProgressBar(Protocol): """Protocol for progress reporting.""" def update( self, msg: str | None = None, current: int | None = None, total: int | None = None, ) -> None: """Report progress. :param msg: Description of the current step. :param current: Current step number. :param total: Total number of steps. """ ... def tick(self) -> None: """Indicate that some work was done without specific progress info.""" ... def finished(self) -> None: """Mark the progress bar as complete.""" ... bzrformats_3.4.0.orig/bzrformats/python-compat.h0000644000000000000000000000445215162073433017033 0ustar00/* * Bazaar -- distributed version control * * Copyright (C) 2008 by Canonical Ltd * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ /* Provide the typedefs that pyrex does automatically in newer versions, to * allow older versions to build our extensions. */ #ifndef _BZR_PYTHON_COMPAT_H #define _BZR_PYTHON_COMPAT_H #ifdef _MSC_VER #define inline __inline #endif #if defined(_WIN32) || defined(WIN32) /* Defining WIN32_LEAN_AND_MEAN makes including windows quite a bit * lighter weight. */ #define WIN32_LEAN_AND_MEAN #include /* Needed for htonl */ #include "Winsock2.h" /* sys/stat.h doesn't have any of these macro definitions for MSVC, so * we'll define whatever is missing that we actually use. */ #if !defined(S_ISDIR) #define S_ISDIR(m) (((m) & 0170000) == 0040000) #endif #if !defined(S_ISREG) #define S_ISREG(m) (((m) & 0170000) == 0100000) #endif #if !defined(S_IXUSR) #define S_IXUSR 0000100/* execute/search permission, owner */ #endif /* sys/stat.h doesn't have S_ISLNK on win32, so we fake it by just always * returning False */ #if !defined(S_ISLNK) #define S_ISLNK(mode) (0) #endif #else /* Not win32 */ /* For htonl */ #include "arpa/inet.h" #endif #include #ifdef _MSC_VER #define snprintf _snprintf /* gcc (mingw32) has strtoll, while the MSVC compiler uses _strtoi64 */ #define strtoll _strtoi64 #define strtoull _strtoui64 #endif #if PY_VERSION_HEX < 0x030900A4 # define Py_SET_REFCNT(obj, refcnt) ((Py_REFCNT(obj) = (refcnt)), (void)0) #endif #endif /* _BZR_PYTHON_COMPAT_H */ bzrformats_3.4.0.orig/bzrformats/recordcounter.py0000644000000000000000000000745615162073400017311 0ustar00# Copyright (C) 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Record counting support for showing progress of revision fetch.""" class RecordCounter: """Container for maintains estimates of work requires for fetch. Instance of this class is used along with a progress bar to provide the user an estimate of the amount of work pending for a fetch (push, pull, branch, checkout) operation. """ def __init__(self): """Initialize a new RecordCounter instance.""" self.initialized = False self.current = 0 self.key_count = 0 self.max = 0 # Users of RecordCounter instance update progress bar every # _STEP_ records. We choose are reasonably high number to keep # display updates from being too frequent. This is an odd number # to ensure that the last digit of the records fetched in # fetches vs estimate ratio changes periodically. self.STEP = 7 def is_initialized(self): """Check if the counter has been initialized. Returns: bool: True if setup() has been called, False otherwise. """ return self.initialized def _estimate_max(self, key_count): """Estimate the maximum amount of 'inserting stream' work. This is just an estimate. """ # Note: The magic number below is based of empirical data # based on 3 seperate projects. Estimatation can probably # be improved but this should work well for most cases. # The project used for the estimate (with approx. numbers) were: # lp:bzr with records_fetched = 7 * revs_required # lp:emacs with records_fetched = 8 * revs_required # bzr-svn checkout of lp:parrot = 10.63 * revs_required # Hence, 10.3 was chosen as for a realistic progress bar as: # 1. If records fetched is is lower than 10.3x then we simply complete # with 10.3x. Under promise, over deliver. # 2. In case of remote fetch, when we start the count fetch vs estimate # display with revs_required/estimate, having a multiplier with a # decimal point produces a realistic looking _estimate_ number rather # than using something like 3125/31250 (for 10x) # 3. Based on the above data, the possibility of overshooting this # factor is minimal, and in case of an overshoot the estimate value # should not need to be corrected too many times. return int(key_count * 10.3) def setup(self, key_count, current=0): """Setup RecordCounter with basic estimate of work pending. Setup self.max and self.current to reflect the amount of work pending for a fetch. """ self.current = current self.key_count = key_count self.max = self._estimate_max(key_count) self.initialized = True def increment(self, count): """Increment self.current by count. Apart from incrementing self.current by count, also ensure that self.max > self.current. """ self.current += count if self.current > self.max: self.max += self.key_count bzrformats_3.4.0.orig/bzrformats/registry.py0000644000000000000000000000645315162115103016274 0ustar00"""Registry imports from catalogus package.""" from catalogus.registry import Registry, _ObjectGetter __all__ = ["FormatRegistry", "Registry", "_ObjectGetter"] # FormatRegistry is not available in catalogus, so we define it here class FormatRegistry(Registry): """Registry specialised for handling formats.""" def __init__(self, other_registry=None): """Initialize FormatRegistry. Args: other_registry: Optional additional registry to mirror registrations to. """ super().__init__() self._other_registry = other_registry def register(self, key, obj, help=None, info=None, override_existing=False): """Register a format object. Args: key: The format name key. obj: The format object or factory function. help: Optional help text for this format. info: Optional additional information about the format. override_existing: Whether to allow overriding existing registrations. Returns: None """ Registry.register( self, key, obj, help=help, info=info, override_existing=override_existing ) if self._other_registry is not None: self._other_registry.register( key, obj, help=help, info=info, override_existing=override_existing ) def register_lazy( self, key, module_name, member_name, help=None, info=None, override_existing=False, ): """Register a format that will be imported on first access. Args: key: The format name key. module_name: Name of the module containing the format. member_name: Name of the format object within the module. help: Optional help text for this format. info: Optional additional information about the format. override_existing: Whether to allow overriding existing registrations. Returns: None """ # Overridden to allow capturing registrations to two seperate # registries in a single call. Registry.register_lazy( self, key, module_name, member_name, help=help, info=info, override_existing=override_existing, ) if self._other_registry is not None: self._other_registry.register_lazy( key, module_name, member_name, help=help, info=info, override_existing=override_existing, ) def remove(self, key): """Remove a format from the registry. Args: key: The format name key to remove. Returns: None """ super().remove(key) if self._other_registry is not None: self._other_registry.remove(key) def get(self, format_string): """Get a format object, calling factory functions if needed. Args: format_string: The format name to retrieve. Returns: The format object, with factory functions automatically called. """ r = Registry.get(self, format_string) if callable(r): r = r() return r bzrformats_3.4.0.orig/bzrformats/revision.py0000644000000000000000000000236015162115103016253 0ustar00# Copyright (C) 2005-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Bazaar-specific revision implementation.""" from bzrformats._bzr_rs import ( CURRENT_REVISION, NULL_REVISION, check_not_reserved_id, is_null, is_reserved_id, ) from bzrformats._bzr_rs import ( Revision as BzrRevision, ) RevisionID = bytes # Re-export the Bazaar revision implementation Revision = BzrRevision __all__ = [ "CURRENT_REVISION", "NULL_REVISION", "BzrRevision", "Revision", "RevisionID", "check_not_reserved_id", "is_null", "is_reserved_id", ] bzrformats_3.4.0.orig/bzrformats/rio_patch.py0000644000000000000000000001175315162073400016376 0ustar00# Copyright (C) 2005 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """RIO-Patch format handling for email-safe stanza representation. This module provides functions to convert between RIO stanzas and RIO-Patch format, which is designed to be emailed as part of a patch. The format resists common forms of damage such as newline conversion or removal of trailing whitespace. The RIO (restricted/reproducible/rfc822-like) format stores data as a series of stanzas containing fields identified by ASCII names with Unicode or string contents. """ # \subsection{\emph{rio} - simple text metaformat} # # \emph{r} stands for `restricted', `reproducible', or `rfc822-like'. # # The stored data consists of a series of \emph{stanzas}, each of which contains # \emph{fields} identified by an ascii name, with Unicode or string contents. # The field tag is constrained to alphanumeric characters. # There may be more than one field in a stanza with the same name. # # The format itself does not deal with character encoding issues, though # the result will normally be written in Unicode. # # The format is intended to be simple enough that there is exactly one character # stream representation of an object and vice versa, and that this relation # will continue to hold for future versions of bzr. import re from . import rio def to_patch_lines(stanza, max_width=72): """Convert a stanza into RIO-Patch format lines. RIO-Patch is a RIO variant designed to be e-mailed as part of a patch. It resists common forms of damage such as newline conversion or the removal of trailing whitespace, yet is also reasonably easy to read. :param max_width: The maximum number of characters per physical line. :return: a list of lines """ if max_width <= 6: raise ValueError(max_width) max_rio_width = max_width - 4 lines = [] for pline in stanza.to_lines(): for line in pline.split(b"\n")[:-1]: line = re.sub(b"\\\\", b"\\\\\\\\", line) while len(line) > 0: partline = line[:max_rio_width] line = line[max_rio_width:] if len(line) > 0 and line[:1] != [b" "]: break_index = -1 break_index = partline.rfind(b" ", -20) if break_index < 3: break_index = partline.rfind(b"-", -20) break_index += 1 if break_index < 3: break_index = partline.rfind(b"/", -20) if break_index >= 3: line = partline[break_index:] + line partline = partline[:break_index] if len(line) > 0: line = b" " + line partline = re.sub(b"\r", b"\\\\r", partline) blank_line = False if len(line) > 0: partline += b"\\" elif re.search(b" $", partline): partline += b"\\" blank_line = True lines.append(b"# " + partline + b"\n") if blank_line: lines.append(b"# \n") return lines def _patch_stanza_iter(line_iter): map = {b"\\\\": b"\\", b"\\r": b"\r", b"\\\n": b""} def mapget(match): return map[match.group(0)] last_line = None for line in line_iter: if line.startswith(b"# "): line = line[2:] elif line.startswith(b"#"): line = line[1:] else: raise ValueError(f"bad line {line!r}") if last_line is not None and len(line) > 2: line = line[2:] line = re.sub(b"\r", b"", line) line = re.sub(b"\\\\(.|\n)", mapget, line) if last_line is None: last_line = line else: last_line += line if last_line[-1:] == b"\n": yield last_line last_line = None if last_line is not None: yield last_line def read_patch_stanza(line_iter): """Convert an iterable of RIO-Patch format lines into a Stanza. RIO-Patch is a RIO variant designed to be e-mailed as part of a patch. It resists common forms of damage such as newline conversion or the removal of trailing whitespace, yet is also reasonably easy to read. :return: a Stanza """ return rio.read_stanza(_patch_stanza_iter(line_iter)) bzrformats_3.4.0.orig/bzrformats/serializer.py0000644000000000000000000001656415162115103016601 0ustar00# Copyright (C) 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Inventory/revision serialization.""" from . import registry from .errors import BzrFormatsError class BadInventoryFormat(BzrFormatsError): """Base exception class for inventory serialization errors.""" _fmt = "Root class for inventory serialization errors" class UnexpectedInventoryFormat(BadInventoryFormat): """Raised when an inventory is not in the expected format.""" _fmt = "The inventory was not in the expected format:\n %(msg)s" def __init__(self, msg): """Initialize UnexpectedInventoryFormat exception. Args: msg: Error message describing the unexpected format. """ super().__init__(msg=msg) class UnsupportedInventoryKind(BzrFormatsError): """Raised when an unsupported inventory entry kind is encountered.""" _fmt = """Unsupported entry kind %(kind)s""" def __init__(self, kind): """Initialize UnsupportedInventoryKind exception. Args: kind: The unsupported entry kind. """ super().__init__() self.kind = kind class RevisionSerializer: """Revision serialization/deserialization.""" squashes_xml_invalid_characters = False def write_revision_to_string(self, rev): """Serialize a revision to a string. Args: rev: The revision object to serialize. Returns: Serialized revision as a string. Raises: NotImplementedError: This method must be implemented by subclasses. """ raise NotImplementedError(self.write_revision_to_string) def write_revision_to_lines(self, rev): """Serialize a revision to a list of lines. Args: rev: The revision object to serialize. Returns: Serialized revision as a list of lines. Raises: NotImplementedError: This method must be implemented by subclasses. """ raise NotImplementedError(self.write_revision_to_lines) def read_revision(self, f): """Read a revision from a file object. Args: f: File-like object to read from. Returns: Deserialized revision object. Raises: NotImplementedError: This method must be implemented by subclasses. """ raise NotImplementedError(self.read_revision) def read_revision_from_string(self, xml_string): """Read a revision from a string. Args: xml_string: String containing the serialized revision. Returns: Deserialized revision object. Raises: NotImplementedError: This method must be implemented by subclasses. """ raise NotImplementedError(self.read_revision_from_string) class InventorySerializer: """Inventory serialization/deserialization.""" def write_inventory(self, inv, f): """Write inventory to a file. Note: this is a *whole inventory* operation, and should only be used sparingly, as it does not scale well with large trees. """ raise NotImplementedError(self.write_inventory) def write_inventory_to_chunks(self, inv): """Produce a simple bytestring chunk representation of an inventory. Note: this is a *whole inventory* operation, and should only be used sparingly, as it does not scale well with large trees. The requirement for the contents of the string is that it can be passed to read_inventory_from_lines and the result is an identical inventory in memory. """ raise NotImplementedError(self.write_inventory_to_chunks) def write_inventory_to_lines(self, inv): """Produce a simple lines representation of an inventory. Note: this is a *whole inventory* operation, and should only be used sparingly, as it does not scale well with large trees. The requirement for the contents of the string is that it can be passed to read_inventory_from_lines and the result is an identical inventory in memory. """ raise NotImplementedError(self.write_inventory_to_lines) def read_inventory_from_lines( self, lines, revision_id=None, entry_cache=None, return_from_cache=False ): """Read bytestring chunks into an inventory object. :param lines: The serialized inventory to read. :param revision_id: If not-None, the expected revision id of the inventory. Some serialisers use this to set the results' root revision. This should be supplied for deserialising all from-repository inventories so that xml5 inventories that were serialised without a revision identifier can be given the right revision id (but not for working tree inventories where users can edit the data without triggering checksum errors or anything). :param entry_cache: An optional cache of InventoryEntry objects. If supplied we will look up entries via (file_id, revision_id) which should map to a valid InventoryEntry (File/Directory/etc) object. :param return_from_cache: Return entries directly from the cache, rather than copying them first. This is only safe if the caller promises not to mutate the returned inventory entries, but it can make some operations significantly faster. """ raise NotImplementedError(self.read_inventory_from_lines) def read_inventory(self, f, revision_id=None): """See read_inventory_from_lines.""" raise NotImplementedError(self.read_inventory) class SerializerRegistry(registry.Registry): """Registry for serializer objects.""" revision_format_registry = SerializerRegistry() revision_format_registry.register_lazy( "5", "bzrformats._bzr_rs", "revision_serializer_v5" ) revision_format_registry.register_lazy( "8", "bzrformats._bzr_rs", "revision_serializer_v8" ) revision_format_registry.register_lazy( "10", "bzrformats._bzr_rs", "revision_bencode_serializer" ) inventory_format_registry = SerializerRegistry() inventory_format_registry.register_lazy( "5", "bzrformats.xml5", "inventory_serializer_v5" ) inventory_format_registry.register_lazy( "6", "bzrformats.xml6", "inventory_serializer_v6" ) inventory_format_registry.register_lazy( "7", "bzrformats.xml7", "inventory_serializer_v7" ) inventory_format_registry.register_lazy( "8", "bzrformats.xml8", "inventory_serializer_v8" ) inventory_format_registry.register_lazy( "9", "bzrformats.chk_serializer", "inventory_chk_serializer_255_bigpage_9" ) inventory_format_registry.register_lazy( "10", "bzrformats.chk_serializer", "inventory_chk_serializer_255_bigpage_10" ) bzrformats_3.4.0.orig/bzrformats/tests/0000755000000000000000000000000015162073400015207 5ustar00bzrformats_3.4.0.orig/bzrformats/textmerge.py0000644000000000000000000001460015162073433016432 0ustar00# Copyright (C) 2006, 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # Author: Martin Pool # Aaron Bentley """Text merge functionality for handling two-way and three-way merges. This module provides classes for merging text files with conflict detection and resolution. It supports structured merge information representation and various merge strategies. """ class TextMerge: """Base class for text-mergers Subclasses must implement _merge_struct. Many methods produce or consume structured merge information. This is an iterable of tuples of lists of lines. Each tuple may have a length of 1 - 3, depending on whether the region it represents is conflicted. Unconflicted region tuples have length 1. Conflicted region tuples have length 2 or 3. Index 1 is text_a, e.g. THIS. Index 1 is text_b, e.g. OTHER. Index 2 is optional. If present, it represents BASE. """ # TODO: Show some version information (e.g. author, date) on conflicted # regions. A_MARKER = b"<<<<<<< \n" B_MARKER = b">>>>>>> \n" SPLIT_MARKER = b"=======\n" def __init__(self, a_marker=A_MARKER, b_marker=B_MARKER, split_marker=SPLIT_MARKER): r"""Initialize a TextMerge instance with conflict markers. Args: a_marker: Marker for the start of conflicted region A (THIS). Defaults to "<<<<<<< \n". b_marker: Marker for the end of conflicted region B (OTHER). Defaults to ">>>>>>> \n". split_marker: Marker separating conflicted regions A and B. Defaults to "=======\n". """ self.a_marker = a_marker self.b_marker = b_marker self.split_marker = split_marker def _merge_struct(self): """Return structured merge info. Must be implemented by subclasses. See TextMerge docstring for details on the format. """ raise NotImplementedError("_merge_struct is abstract") def struct_to_lines(self, struct_iter): """Convert merge result tuples to lines.""" for lines in struct_iter: if len(lines) == 1: yield from lines[0] else: yield self.a_marker yield from lines[0] yield self.split_marker yield from lines[1] yield self.b_marker def iter_useful(self, struct_iter): """Iterate through input tuples, skipping empty ones.""" for group in struct_iter: if len(group[0]) > 0: yield group elif len(group) > 1 and len(group[1]) > 0: yield group def merge_lines(self, reprocess=False): """Produce an iterable of lines, suitable for writing to a file Returns a tuple of (line iterable, conflict indicator) If reprocess is True, a two-way merge will be performed on the intermediate structure, to reduce conflict regions. """ struct = [] conflicts = False for group in self.merge_struct(reprocess): struct.append(group) if len(group) > 1: conflicts = True return self.struct_to_lines(struct), conflicts def merge_struct(self, reprocess=False): """Produce structured merge info.""" struct_iter = self.iter_useful(self._merge_struct()) if reprocess is True: return self.reprocess_struct(struct_iter) else: return struct_iter @staticmethod def reprocess_struct(struct_iter): """Perform a two-way merge on structural merge info. This reduces the size of conflict regions, but breaks the connection between the BASE text and the conflict region. This process may split a single conflict region into several smaller ones, but will not introduce new conflicts. """ for group in struct_iter: if len(group) == 1: yield group else: yield from Merge2(group[0], group[1]).merge_struct() class Merge2(TextMerge): """Two-way merge. In a two way merge, common regions are shown as unconflicting, and uncommon regions produce conflicts. """ def __init__( self, lines_a, lines_b, a_marker=TextMerge.A_MARKER, b_marker=TextMerge.B_MARKER, split_marker=TextMerge.SPLIT_MARKER, ): """Initialize a two-way merge operation. Args: lines_a: Sequence of lines from the first text (THIS). lines_b: Sequence of lines from the second text (OTHER). a_marker: Marker for the start of conflicted region A. Defaults to TextMerge.A_MARKER. b_marker: Marker for the end of conflicted region B. Defaults to TextMerge.B_MARKER. split_marker: Marker separating conflicted regions A and B. Defaults to TextMerge.SPLIT_MARKER. """ TextMerge.__init__(self, a_marker, b_marker, split_marker) self.lines_a = lines_a self.lines_b = lines_b def _merge_struct(self): """Return structured merge info. See TextMerge docstring. """ import patiencediff sm = patiencediff.PatienceSequenceMatcher(None, self.lines_a, self.lines_b) pos_a = 0 pos_b = 0 for ai, bi, l in sm.get_matching_blocks(): # non-matching lines yield (self.lines_a[pos_a:ai], self.lines_b[pos_b:bi]) # matching lines yield (self.lines_a[ai : ai + l],) pos_a = ai + l pos_b = bi + l # final non-matching lines yield (self.lines_a[pos_a:-1], self.lines_b[pos_b:-1]) bzrformats_3.4.0.orig/bzrformats/transport.py0000644000000000000000000003426015162115107016461 0ustar00# Copyright (C) 2025 Breezy Contributors # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Minimal transport for bzrformats. Provides a Transport protocol and a simple in-memory implementation. """ import posixpath from io import BytesIO from typing import Protocol, runtime_checkable from urllib.parse import unquote from .errors import PathError class NoSuchFile(PathError): """A file or directory does not exist.""" _fmt = "No such file: %(path)r%(extra)s" # Tuple for catching NoSuchFile from both bzrformats and breezy transports. # Use this in except clauses when the transport may be either implementation. try: from breezy.transport import NoSuchFile as _BreezyNoSuchFile TransportNoSuchFile = (NoSuchFile, _BreezyNoSuchFile) except ImportError: TransportNoSuchFile = NoSuchFile class FileExists(PathError): """A file or directory already exists.""" _fmt = "File exists: %(path)r%(extra)s" @runtime_checkable class Transport(Protocol): """Minimal transport protocol for bzrformats.""" base: str def get(self, relpath: str): """Get a file-like object for reading.""" ... def get_bytes(self, relpath: str) -> bytes: """Get the raw bytes of a file.""" ... def put_bytes(self, relpath: str, raw_bytes: bytes, mode=None): """Atomically put bytes at a location.""" ... def put_file(self, relpath: str, f, mode=None) -> int: """Write a file from a file-like object, returning bytes written.""" ... def put_file_non_atomic(self, relpath: str, f, mode=None, create_parent_dir=False): """Put a file-like object at a location.""" ... def append_bytes(self, relpath: str, raw_bytes: bytes, mode=None) -> int: """Append bytes to a file, returning the byte offset of the start.""" ... def readv(self, relpath: str, offsets): """Get parts of a file. :param offsets: List of (offset, size) tuples. :yields: (offset, data) tuples. """ ... def open_write_stream(self, relpath: str, mode=None): """Open a writable stream at relpath.""" ... def mkdir(self, relpath: str, mode=None): """Create a directory.""" ... def delete(self, relpath: str): """Delete a file.""" ... def move(self, rel_from: str, rel_to: str): """Move (rename) a file.""" ... def stat(self, relpath: str): """Return a stat-like object for a file.""" ... def has(self, relpath: str) -> bool: """Return True if the path exists.""" ... def abspath(self, relpath: str) -> str: """Return the full URL for the given relative path.""" ... def clone(self, relpath: str | None = None): """Return a new transport pointing at a sub-directory.""" ... def iter_files_recursive(self): """Iterate over all files below this transport, yielding relpaths.""" ... def ensure_base(self): """Ensure the base directory exists.""" ... def recommended_page_size(self) -> int: """Return the recommended number of bytes to read at once.""" ... class _MemoryStat: """Minimal stat result for MemoryTransport.""" def __init__(self, size, is_dir=False): self.st_size = size if is_dir: self.st_mode = 0o40755 else: self.st_mode = 0o100644 class _MemoryWriteStream: """A write stream that writes directly to the backing store. Data is visible to readers immediately after each ``write()``. """ def __init__(self, files, path): self._files = files self._path = path self._files.setdefault(path, b"") def write(self, data): self._files[self._path] = self._files.get(self._path, b"") + data def close(self): pass def __enter__(self): return self def __exit__(self, *args): self.close() def _sort_expand_and_combine(offsets, upper_limit, page_size): """Sort, expand, and combine readv offsets to reduce round trips. Each range is expanded to at least *page_size* bytes (centered on the original range), then overlapping ranges are merged. """ if not offsets: return [] sorted_offsets = sorted(offsets) expanded = [] for offset, length in sorted_offsets: expansion = max(0, page_size - length) reduction = expansion // 2 new_offset = max(0, offset - reduction) new_length = length + expansion if upper_limit: new_end = min(upper_limit, new_offset + new_length) new_length = max(0, new_end - new_offset) if new_length > 0: expanded.append((new_offset, new_length)) if not expanded: return [] merged = [expanded[0]] for offset, length in expanded[1:]: prev_offset, prev_length = merged[-1] prev_end = prev_offset + prev_length end = offset + length if offset > prev_end: merged.append((offset, length)) elif end > prev_end: merged[-1] = (prev_offset, end - prev_offset) return merged class MemoryTransport: """Simple in-memory transport for testing. All MemoryTransport instances sharing the same ``_files`` and ``_dirs`` dicts see the same data, so :meth:`clone` produces a view onto the same store. """ def __init__(self, url="memory:///", _files=None, _dirs=None): """Initialize MemoryTransport, optionally sharing an existing store.""" if not url.endswith("/"): url += "/" self.base = url self._files = _files if _files is not None else {} self._dirs = _dirs if _dirs is not None else set() self._dirs.add("/") # -- internal helpers -- def _abspath(self, relpath): """Resolve *relpath* to an absolute path within the store.""" if relpath is None or relpath == ".": relpath = "" relpath = unquote(relpath) path = posixpath.join(self._path(), relpath) return posixpath.normpath(path) def _path(self): """Extract the path portion from the base URL.""" path = self.base.split("://", 1)[-1] if path.endswith("/"): path = path[:-1] return path or "/" # -- Transport interface -- def clone(self, relpath=None): """Return a new transport rooted at *relpath*.""" if relpath is None: return MemoryTransport(self.base, self._files, self._dirs) return MemoryTransport(self.abspath(relpath), self._files, self._dirs) def abspath(self, relpath): """Return the full ``memory://`` URL for *relpath*.""" return "memory://" + self._abspath(relpath) def has(self, relpath): """Return True if *relpath* exists as a file or directory.""" path = self._abspath(relpath) return path in self._files or path in self._dirs def get(self, relpath): """Return a :class:`BytesIO` with the contents of *relpath*.""" path = self._abspath(relpath) try: return BytesIO(self._files[path]) except KeyError: raise NoSuchFile(relpath) from None def get_bytes(self, relpath): """Return the raw bytes of *relpath*.""" path = self._abspath(relpath) try: return self._files[path] except KeyError: raise NoSuchFile(relpath) from None def put_bytes(self, relpath, raw_bytes, mode=None): """Store *raw_bytes* at *relpath*.""" self._files[self._abspath(relpath)] = raw_bytes def put_file(self, relpath, f, mode=None): """Write *f* to *relpath*, returning the number of bytes written.""" data = f.read() self._files[self._abspath(relpath)] = data return len(data) def put_file_non_atomic(self, relpath, f, mode=None, create_parent_dir=False): """Write *f* to *relpath*, creating parent dirs if requested.""" if create_parent_dir: self._ensure_parent(relpath) self._files[self._abspath(relpath)] = f.read() def append_bytes(self, relpath, raw_bytes, mode=None): """Append *raw_bytes* to *relpath*, returning the start offset.""" path = self._abspath(relpath) existing = self._files.get(path, b"") pos = len(existing) self._files[path] = existing + raw_bytes return pos def readv(self, relpath, offsets, adjust_for_latency=False, upper_limit=0): """Yield ``(offset, data)`` for each ``(offset, length)`` in *offsets*.""" file_data = self.get_bytes(relpath) offsets = list(offsets) if adjust_for_latency and offsets: offsets = _sort_expand_and_combine( offsets, upper_limit or len(file_data), self.recommended_page_size() ) for offset, length in offsets: yield offset, file_data[offset : offset + length] def open_write_stream(self, relpath, mode=None): """Return a writable stream; data is stored on close.""" return _MemoryWriteStream(self._files, self._abspath(relpath)) def mkdir(self, relpath, mode=None): """Create a directory at *relpath*. Does not raise if the directory already exists. """ self._dirs.add(self._abspath(relpath)) def delete(self, relpath): """Delete the file at *relpath*.""" path = self._abspath(relpath) try: del self._files[path] except KeyError: raise NoSuchFile(relpath) from None def move(self, rel_from, rel_to): """Move (rename) a file from *rel_from* to *rel_to*.""" path_from = self._abspath(rel_from) path_to = self._abspath(rel_to) try: self._files[path_to] = self._files.pop(path_from) except KeyError: raise NoSuchFile(rel_from) from None def stat(self, relpath): """Return a stat-like object for *relpath*.""" path = self._abspath(relpath) if path in self._dirs: return _MemoryStat(0, is_dir=True) if path in self._files: return _MemoryStat(len(self._files[path])) raise NoSuchFile(relpath) def iter_files_recursive(self): """Yield relative paths of all files below this transport.""" prefix = self._path().rstrip("/") + "/" for path in sorted(self._files): if path.startswith(prefix): yield path[len(prefix) :] def ensure_base(self): """Ensure the base directory exists.""" self._dirs.add(self._path()) def recommended_page_size(self): """Return a reasonable read-ahead size.""" return 4096 def _ensure_parent(self, relpath): """Ensure the parent directory of *relpath* exists.""" parent = posixpath.dirname(self._abspath(relpath)) self._dirs.add(parent) def __repr__(self): """Return string representation.""" return f"MemoryTransport({self.base!r})" class TracingTransport: """Transport wrapper that records operations in ``_activity``. Wraps another transport and delegates all calls. Selected operations are recorded as tuples in ``_activity`` for test assertions. The tuple format matches breezy's ``TransportTraceDecorator``. """ def __init__(self, inner): """Initialize with the transport to wrap.""" self._inner = inner self._activity = [] def __getattr__(self, name): """Delegate everything not explicitly overridden to the inner transport.""" return getattr(self._inner, name) @property def base(self): """Return the base URL of the inner transport.""" return self._inner.base # -- traced methods (match breezy's TransportTraceDecorator format) -- def get(self, relpath): """Get file contents, recording the operation.""" self._activity.append(("get", relpath)) return self._inner.get(relpath) def get_bytes(self, relpath): """Get file bytes, recording the operation.""" self._activity.append(("get", relpath)) return self._inner.get_bytes(relpath) def put_bytes(self, relpath, raw_bytes, mode=None): """Put bytes, recording the operation.""" self._activity.append(("put_bytes", relpath, len(raw_bytes), mode)) return self._inner.put_bytes(relpath, raw_bytes, mode) def mkdir(self, relpath, mode=None): """Create a directory, recording the operation.""" self._activity.append(("mkdir", relpath, mode)) return self._inner.mkdir(relpath, mode) def readv(self, relpath, offsets, adjust_for_latency=False, upper_limit=None): """Read multiple ranges, recording the operation.""" self._activity.append( ("readv", relpath, list(offsets), adjust_for_latency, upper_limit) ) return self._inner.readv( relpath, offsets, adjust_for_latency=adjust_for_latency, upper_limit=upper_limit, ) # -- non-traced pass-through for common methods -- def put_file(self, relpath, f, mode=None): """Write a file to the inner transport.""" return self._inner.put_file(relpath, f, mode) def clone(self, relpath=None): """Clone this tracing transport.""" return TracingTransport(self._inner.clone(relpath)) def recommended_page_size(self): """Return the recommended page size from the inner transport.""" return self._inner.recommended_page_size() def __repr__(self): """Return string representation.""" return f"TracingTransport({self._inner!r})" bzrformats_3.4.0.orig/bzrformats/tuned_gzip.py0000644000000000000000000000520415162074037016577 0ustar00# Copyright (C) 2006-2011 Canonical Ltd # Written by Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Legacy bazaar specific gzip tunings.""" import struct import zlib __all__ = ["chunks_to_gzip"] def U32(i): """Return i as an unsigned integer, assuming it fits in 32 bits. If it's >= 2GB when viewed as a 32-bit unsigned int, return a long. """ if i < 0: i += 1 << 32 return i def LOWU32(i): """Return the low-order 32 bits of an int, as a non-negative int.""" return i & 0xFFFFFFFF def chunks_to_gzip( chunks, factory=zlib.compressobj, level=zlib.Z_DEFAULT_COMPRESSION, method=zlib.DEFLATED, width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL, crc32=zlib.crc32, ): """Create a gzip file containing chunks and return its content. :param chunks: An iterable of strings. Each string can have arbitrary layout. """ result = [ b"\037\213" # self.fileobj.write('\037\213') # magic header b"\010" # self.fileobj.write('\010') # compression method # fname = self.filename[:-3] # flags = 0 # if fname: # flags = FNAME b"\x00" # self.fileobj.write(chr(flags)) b"\0\0\0\0" # write32u(self.fileobj, long(time.time())) b"\002" # self.fileobj.write('\002') b"\377" # self.fileobj.write('\377') # if fname: b"" # self.fileobj.write(fname + '\000') ] # using a compressobj avoids a small header and trailer that the compress() # utility function adds. compress = factory(level, method, width, mem, 0) crc = 0 total_len = 0 for chunk in chunks: crc = crc32(chunk, crc) total_len += len(chunk) zbytes = compress.compress(chunk) if zbytes: result.append(zbytes) result.append(compress.flush()) # size may exceed 2GB, or even 4GB result.append(struct.pack(" None: """Create a ContentFactory.""" self.sha1: bytes | None = None self.size: int | None = None self.storage_kind: str | None = None self.key: tuple[bytes, ...] | None = None self.parents = None def map_key(self, cb): """Add prefix to all keys.""" if self.key is not None: self.key = cb(self.key) if self.parents is not None: self.parents = tuple([cb(parent) for parent in self.parents]) return self class FileContentFactory(ContentFactory): """File-based content factory.""" def __init__(self, key, parents, fileobj, sha1=None, size=None): """Initialize a FileContentFactory. Args: key: Unique identifier for this content. parents: Parent keys for this content. fileobj: File-like object containing the content data. sha1: SHA1 hash of the content (optional). size: Size of the content in bytes (optional). """ self.key = key self.parents = parents self.file = fileobj self.storage_kind = "file" self.sha1 = sha1 self.size = size self._needs_reset = False def get_bytes_as(self, storage_kind): """Get the content bytes in the specified storage format. Args: storage_kind: The desired storage format ('fulltext', 'chunked', 'lines'). Returns: bytes or list: The content data in the requested format. Raises: UnavailableRepresentation: If the requested storage kind is not supported. """ if self._needs_reset: self.file.seek(0) self._needs_reset = True if storage_kind == "fulltext": return self.file.read() elif storage_kind == "chunked": return list(file_iterator(self.file)) elif storage_kind == "lines": return list(self.file.readlines()) raise UnavailableRepresentation(self.key, storage_kind, self.storage_kind) def iter_bytes_as(self, storage_kind): """Iterate over content bytes in the specified storage format. Args: storage_kind: The desired storage format ('chunked', 'lines'). Returns: iterator: Iterator over the content data in the requested format. Raises: UnavailableRepresentation: If the requested storage kind is not supported. """ if self._needs_reset: self.file.seek(0) self._needs_reset = True if storage_kind == "chunked": return osutils.file_iterator(self.file) elif storage_kind == "lines": return self.file raise UnavailableRepresentation(self.key, storage_kind, self.storage_kind) class AdapterFactory(ContentFactory): """A content factory to adapt between key prefix's.""" def __init__(self, key, parents, adapted): """Create an adapter factory instance.""" self.key = key self.parents = parents self._adapted = adapted def __getattr__(self, attr): """Return a member from the adapted object.""" if attr in ("key", "parents"): return self.__dict__[attr] else: return getattr(self._adapted, attr) def filter_absent(record_stream): """Adapt a record stream to remove absent records.""" for record in record_stream: if record.storage_kind != "absent": yield record class _MPDiffGenerator: """Pull out the functionality for generating mp_diffs.""" def __init__(self, vf, keys): self.vf = vf # This is the order the keys were requested in self.ordered_keys = tuple(keys) # keys + their parents, what we need to compute the diffs self.needed_keys = () # Map from key: mp_diff self.diffs = {} # Map from key: parents_needed (may have ghosts) self.parent_map = {} # Parents that aren't present self.ghost_parents = () # Map from parent_key => number of children for this text self.refcounts = {} # Content chunks that are cached while we still need them self.chunks = {} def _find_needed_keys(self): """Find the set of keys we need to request. This includes all the original keys passed in, and the non-ghost parents of those keys. :return: (needed_keys, refcounts) needed_keys is the set of all texts we need to extract refcounts is a dict of {key: num_children} letting us know when we no longer need to cache a given parent text """ # All the keys and their parents needed_keys = set(self.ordered_keys) parent_map = self.vf.get_parent_map(needed_keys) self.parent_map = parent_map # TODO: Should we be using a different construct here? I think this # uses difference_update internally, and we expect the result to # be tiny missing_keys = needed_keys.difference(parent_map) if missing_keys: raise RevisionNotPresent(list(missing_keys)[0], self.vf) # Parents that might be missing. They are allowed to be ghosts, but we # should check for them refcounts = {} setdefault = refcounts.setdefault just_parents = set() for _child_key, parent_keys in parent_map.items(): if not parent_keys: # parent_keys may be None if a given VersionedFile claims to # not support graph operations. continue just_parents.update(parent_keys) needed_keys.update(parent_keys) for p in parent_keys: refcounts[p] = setdefault(p, 0) + 1 just_parents.difference_update(parent_map) # Remove any parents that are actually ghosts from the needed set self.present_parents = set(self.vf.get_parent_map(just_parents)) self.ghost_parents = just_parents.difference(self.present_parents) needed_keys.difference_update(self.ghost_parents) self.needed_keys = needed_keys self.refcounts = refcounts return needed_keys, refcounts def _compute_diff(self, key, parent_lines, lines): """Compute a single mp_diff, and store it in self._diffs.""" if len(parent_lines) > 0: # XXX: _extract_blocks is not usefully defined anywhere... # It was meant to extract the left-parent diff without # having to recompute it for Knit content (pack-0.92, # etc). That seems to have regressed somewhere left_parent_blocks = self.vf._extract_blocks(key, parent_lines[0], lines) else: left_parent_blocks = None diff = multiparent.MultiParent.from_lines( lines, parent_lines, left_parent_blocks ) self.diffs[key] = diff def _process_one_record(self, key, this_chunks): parent_keys = None if key in self.parent_map: # This record should be ready to diff, since we requested # content in 'topological' order parent_keys = self.parent_map.pop(key) # If a VersionedFile claims 'no-graph' support, then it may return # None for any parent request, so we replace it with an empty tuple if parent_keys is None: parent_keys = () parent_lines = [] for p in parent_keys: # Alternatively we could check p not in self.needed_keys, but # ghost_parents should be tiny versus huge if p in self.ghost_parents: continue refcount = self.refcounts[p] if refcount == 1: # Last child reference self.refcounts.pop(p) parent_chunks = self.chunks.pop(p) else: self.refcounts[p] = refcount - 1 parent_chunks = self.chunks[p] p_lines = osutils.chunks_to_lines(parent_chunks) # TODO: Should we cache the line form? We did the # computation to get it, but storing it this way will # be less memory efficient... parent_lines.append(p_lines) del p_lines lines = osutils.chunks_to_lines(this_chunks) # Since we needed the lines, we'll go ahead and cache them this way this_chunks = lines self._compute_diff(key, parent_lines, lines) del lines # Is this content required for any more children? if key in self.refcounts: self.chunks[key] = this_chunks def _extract_diffs(self): needed_keys, _refcounts = self._find_needed_keys() for record in self.vf.get_record_stream(needed_keys, "topological", True): if record.storage_kind == "absent": raise RevisionNotPresent(record.key, self.vf) self._process_one_record(record.key, record.get_bytes_as("chunked")) def compute_diffs(self): self._extract_diffs() dpop = self.diffs.pop return [dpop(k) for k in self.ordered_keys] class VersionedFile: """Versioned text file storage. A versioned file manages versions of line-based text files, keeping track of the originating version for each line. To clients the "lines" of the file are represented as a list of strings. These strings will typically have terminal newline characters, but this is not required. In particular files commonly do not have a newline at the end of the file. Texts are identified by a version-id string. """ @staticmethod def check_not_reserved_id(version_id): """Check that a version ID is not a reserved identifier. Args: version_id: The version ID to check, or None. Raises: ValueError: If version_id is a reserved identifier. """ if version_id is not None: revision.check_not_reserved_id(version_id) def copy_to(self, name, transport): """Copy this versioned file to name on transport.""" raise NotImplementedError(self.copy_to) def get_record_stream(self, versions, ordering, include_delta_closure): """Get a stream of records for versions. :param versions: The versions to include. Each version is a tuple (version,). :param ordering: Either 'unordered' or 'topological'. A topologically sorted stream has compression parents strictly before their children. :param include_delta_closure: If True then the closure across any compression parents will be included (in the data content of the stream, not in the emitted records). This guarantees that 'fulltext' can be used successfully on every record. :return: An iterator of ContentFactory objects, each of which is only valid until the iterator is advanced. """ raise NotImplementedError(self.get_record_stream) def has_version(self, version_id): """Returns whether version is present.""" raise NotImplementedError(self.has_version) def insert_record_stream(self, stream): """Insert a record stream into this versioned file. :param stream: A stream of records to insert. :return: None :seealso VersionedFile.get_record_stream: """ raise NotImplementedError def add_lines( self, version_id, parents, lines, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): r"""Add a single text on top of the versioned file. Must raise RevisionAlreadyPresent if the new version is already present in file history. Must raise RevisionNotPresent if any of the given parents are not present in file history. :param lines: A list of lines. Each line must be a bytestring. And all of them except the last must be terminated with \n and contain no other \n's. The last line may either contain no \n's or a single terminated \n. If the lines list does meet this constraint the add routine may error or may succeed - but you will be unable to read the data back accurately. (Checking the lines have been split correctly is expensive and extremely unlikely to catch bugs so it is not done at runtime unless check_content is True.) :param parent_texts: An optional dictionary containing the opaque representations of some or all of the parents of version_id to allow delta optimisations. VERY IMPORTANT: the texts must be those returned by add_lines or data corruption can be caused. :param left_matching_blocks: a hint about which areas are common between the text and its left-hand-parent. The format is the SequenceMatcher.get_matching_blocks format. :param nostore_sha: Raise ExistingContent and do not add the lines to the versioned file if the digest of the lines matches this. :param random_id: If True a random id has been selected rather than an id determined by some deterministic process such as a converter from a foreign VCS. When True the backend may choose not to check for uniqueness of the resulting key within the versioned file, so this should only be done when the result is expected to be unique anyway. :param check_content: If True, the lines supplied are verified to be bytestrings that are correctly formed lines. :return: The text sha1, the number of bytes in the text, and an opaque representation of the inserted version which can be provided back to future add_lines calls in the parent_texts dictionary. """ self._check_write_ok() return self._add_lines( version_id, parents, lines, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ) def _add_lines( self, version_id, parents, lines, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ): """Helper to do the class specific add_lines.""" raise NotImplementedError(self.add_lines) def add_lines_with_ghosts( self, version_id, parents, lines, parent_texts=None, nostore_sha=None, random_id=False, check_content=True, left_matching_blocks=None, ): """Add lines to the versioned file, allowing ghosts to be present. This takes the same parameters as add_lines and returns the same. """ self._check_write_ok() return self._add_lines_with_ghosts( version_id, parents, lines, parent_texts, nostore_sha, random_id, check_content, left_matching_blocks, ) def _add_lines_with_ghosts( self, version_id, parents, lines, parent_texts, nostore_sha, random_id, check_content, left_matching_blocks, ): """Helper to do class specific add_lines_with_ghosts.""" raise NotImplementedError(self.add_lines_with_ghosts) def check(self, progress_bar=None): """Check the versioned file for integrity.""" raise NotImplementedError(self.check) def _check_lines_not_unicode(self, lines): """Check that lines being added to a versioned file are not unicode.""" for line in lines: if not isinstance(line, bytes): raise TypeError("lines") def _check_lines_are_lines(self, lines): """Check that the lines really are full lines without inline EOL.""" for line in lines: if b"\n" in line[:-1]: raise ValueError("lines contain newlines") def get_format_signature(self): """Get a text description of the data encoding in this file. :since: 0.90 """ raise NotImplementedError(self.get_format_signature) def make_mpdiffs(self, version_ids): """Create multiparent diffs for specified versions.""" # XXX: Can't use _MPDiffGenerator just yet. This is because version_ids # is a list of strings, not keys. And while self.get_record_stream # is supported, it takes *keys*, while self.get_parent_map() takes # strings... *sigh* knit_versions = set() knit_versions.update(version_ids) parent_map = self.get_parent_map(version_ids) for version_id in version_ids: try: knit_versions.update(parent_map[version_id]) except KeyError as e: raise RevisionNotPresent(version_id, self) from e # We need to filter out ghosts, because we can't diff against them. knit_versions = set(self.get_parent_map(knit_versions)) lines = dict( zip( knit_versions, self._get_lf_split_line_list(knit_versions), strict=False ) ) diffs = [] for version_id in version_ids: target = lines[version_id] try: parents = [ lines[p] for p in parent_map[version_id] if p in knit_versions ] except KeyError as e: # I don't know how this could ever trigger. # parent_map[version_id] was already triggered in the previous # for loop, and lines[p] has the 'if p in knit_versions' check, # so we again won't have a KeyError. raise RevisionNotPresent(version_id, self) from e if len(parents) > 0: left_parent_blocks = self._extract_blocks( version_id, parents[0], target ) else: left_parent_blocks = None diffs.append( multiparent.MultiParent.from_lines(target, parents, left_parent_blocks) ) return diffs def _extract_blocks(self, version_id, source, target): return None def add_mpdiffs(self, records): """Add mpdiffs to this VersionedFile. Records should be iterables of version, parents, expected_sha1, mpdiff. mpdiff should be a MultiParent instance. """ # Does this need to call self._check_write_ok()? (IanC 20070919) vf_parents = {} mpvf = multiparent.MultiMemoryVersionedFile() versions = [] for version, parent_ids, _expected_sha1, mpdiff in records: versions.append(version) mpvf.add_diff(mpdiff, version, parent_ids) needed_parents = set() for _version, parent_ids, _expected_sha1, _mpdiff in records: needed_parents.update(p for p in parent_ids if not mpvf.has_version(p)) present_parents = set(self.get_parent_map(needed_parents)) for parent_id, lines in zip( present_parents, self._get_lf_split_line_list(present_parents), strict=False ): mpvf.add_version(lines, parent_id, []) for (version, parent_ids, _expected_sha1, mpdiff), lines in zip( records, mpvf.get_line_list(versions), strict=False ): if len(parent_ids) == 1: left_matching_blocks = list( mpdiff.get_matching_blocks( 0, mpvf.get_diff(parent_ids[0]).num_lines() ) ) else: left_matching_blocks = None try: _, _, version_text = self.add_lines_with_ghosts( version, parent_ids, lines, vf_parents, left_matching_blocks=left_matching_blocks, ) except NotImplementedError: # The vf can't handle ghosts, so add lines normally, which will # (reasonably) fail if there are ghosts in the data. _, _, version_text = self.add_lines( version, parent_ids, lines, vf_parents, left_matching_blocks=left_matching_blocks, ) vf_parents[version] = version_text sha1s = self.get_sha1s(versions) for version, _parent_ids, expected_sha1, _mpdiff in records: if expected_sha1 != sha1s[version]: raise VersionedFileInvalidChecksum(version) def get_text(self, version_id): """Return version contents as a text string. Raises RevisionNotPresent if version is not present in file history. """ return b"".join(self.get_lines(version_id)) get_string = get_text def get_texts(self, version_ids): """Return the texts of listed versions as a list of strings. Raises RevisionNotPresent if version is not present in file history. """ return [b"".join(self.get_lines(v)) for v in version_ids] def get_lines(self, version_id): """Return version contents as a sequence of lines. Raises RevisionNotPresent if version is not present in file history. """ raise NotImplementedError(self.get_lines) def _get_lf_split_line_list(self, version_ids): return [BytesIO(t).readlines() for t in self.get_texts(version_ids)] def get_ancestry(self, version_ids): """Return a list of all ancestors of given version(s). This will not include the null revision. Must raise RevisionNotPresent if any of the given versions are not present in file history. """ raise NotImplementedError(self.get_ancestry) def get_ancestry_with_ghosts(self, version_ids): """Return a list of all ancestors of given version(s). This will not include the null revision. Must raise RevisionNotPresent if any of the given versions are not present in file history. Ghosts that are known about will be included in ancestry list, but are not explicitly marked. """ raise NotImplementedError(self.get_ancestry_with_ghosts) def get_parent_map(self, version_ids): """Get a map of the parents of version_ids. :param version_ids: The version ids to look up parents for. :return: A mapping from version id to parents. """ raise NotImplementedError(self.get_parent_map) def get_parents_with_ghosts(self, version_id): """Return version names for parents of version_id. Will raise RevisionNotPresent if version_id is not present in the history. Ghosts that are known about will be included in the parent list, but are not explicitly marked. """ try: return list(self.get_parent_map([version_id])[version_id]) except KeyError as e: raise RevisionNotPresent(version_id, self) from e def annotate(self, version_id): """Return a list of (version-id, line) tuples for version_id. :raise RevisionNotPresent: If the given version is not present in file history. """ raise NotImplementedError(self.annotate) def iter_lines_added_or_present_in_versions(self, version_ids=None, pb=None): r"""Iterate over the lines in the versioned file from version_ids. This may return lines from other versions. Each item the returned iterator yields is a tuple of a line and a text version that that line is present in (not introduced in). Ordering of results is in whatever order is most suitable for the underlying storage format. If a progress bar is supplied, it may be used to indicate progress. The caller is responsible for cleaning up progress bars (because this is an iterator). NOTES: Lines are normalised: they will all have \n terminators. Lines are returned in arbitrary order. :return: An iterator over (line, version_id). """ raise NotImplementedError(self.iter_lines_added_or_present_in_versions) def plan_merge(self, ver_a, ver_b, base=None): """Return pseudo-annotation indicating how the two versions merge. This is computed between versions a and b and their common base. Weave lines present in none of them are skipped entirely. Legend: killed-base Dead in base revision killed-both Killed in each revision killed-a Killed in a killed-b Killed in b unchanged Alive in both a and b (possibly created in both) new-a Created in a new-b Created in b ghost-a Killed in a, unborn in b ghost-b Killed in b, unborn in a irrelevant Not in either revision """ raise NotImplementedError(VersionedFile.plan_merge) def weave_merge( self, plan, a_marker=TextMerge.A_MARKER, b_marker=TextMerge.B_MARKER ): """Merge text using a weave merge algorithm. Args: plan: The merge plan to execute. a_marker: Marker for 'A' side conflicts (optional). b_marker: Marker for 'B' side conflicts (optional). Returns: list: Merged lines of text. """ return PlanWeaveMerge(plan, a_marker, b_marker).merge_lines()[0] class RecordingVersionedFilesDecorator: """A minimal versioned files that records calls made on it. Only enough methods have been added to support tests using it to date. :ivar calls: A list of the calls made; can be reset at any time by assigning [] to it. """ def __init__(self, backing_vf): """Create a RecordingVersionedFilesDecorator decorating backing_vf. :param backing_vf: The versioned file to answer all methods. """ self._backing_vf = backing_vf self.calls = [] def add_lines( self, key, parents, lines, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): """Add lines to the versioned file and record the call. Args: key: The key for the new version. parents: Parent keys for the new version. lines: The text lines to add. parent_texts: Parent text data (optional). left_matching_blocks: Matching blocks for delta compression (optional). nostore_sha: SHA to skip storing if duplicate (optional). random_id: Whether to use a random ID (optional). check_content: Whether to validate content (optional). Returns: The result from the backing versioned file. """ self.calls.append( ( "add_lines", key, parents, lines, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ) ) return self._backing_vf.add_lines( key, parents, lines, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ) def add_content( self, factory, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): """Add content from a factory and record the call. Args: factory: ContentFactory providing the content. parent_texts: Parent text data (optional). left_matching_blocks: Matching blocks for delta compression (optional). nostore_sha: SHA to skip storing if duplicate (optional). random_id: Whether to use a random ID (optional). check_content: Whether to validate content (optional). Returns: The result from the backing versioned file. """ self.calls.append( ( "add_content", factory, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ) ) return self._backing_vf.add_content( factory, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ) def check(self): """Check the backing versioned file for consistency.""" self._backing_vf.check() def get_parent_map(self, keys): """Get parent mapping for keys and record the call. Args: keys: Keys to get parent mapping for. Returns: dict: Mapping of keys to their parents. """ self.calls.append(("get_parent_map", copy(keys))) return self._backing_vf.get_parent_map(keys) def get_record_stream(self, keys, sort_order, include_delta_closure): """Get a stream of records and record the call. Args: keys: Keys to get records for. sort_order: How to sort the results. include_delta_closure: Whether to include delta closure. Returns: Iterator over record data. """ self.calls.append( ("get_record_stream", list(keys), sort_order, include_delta_closure) ) return self._backing_vf.get_record_stream( keys, sort_order, include_delta_closure ) def get_sha1s(self, keys): """Get SHA1 hashes for keys and record the call. Args: keys: Keys to get SHA1s for. Returns: dict: Mapping of keys to their SHA1 hashes. """ self.calls.append(("get_sha1s", copy(keys))) return self._backing_vf.get_sha1s(keys) def iter_lines_added_or_present_in_keys(self, keys, pb=None): """Iterate over lines added or present in keys and record the call. Args: keys: Keys to iterate over. pb: Optional progress bar. Returns: Iterator over lines. """ self.calls.append(("iter_lines_added_or_present_in_keys", copy(keys))) return self._backing_vf.iter_lines_added_or_present_in_keys(keys, pb=pb) def keys(self): """Get all keys and record the call. Returns: Iterable of all keys in the versioned file. """ self.calls.append(("keys",)) return self._backing_vf.keys() class OrderingVersionedFilesDecorator(RecordingVersionedFilesDecorator): """A VF that records calls, and returns keys in specific order. :ivar calls: A list of the calls made; can be reset at any time by assigning [] to it. """ def __init__(self, backing_vf, key_priority): """Create a RecordingVersionedFilesDecorator decorating backing_vf. :param backing_vf: The versioned file to answer all methods. :param key_priority: A dictionary defining what order keys should be returned from an 'unordered' get_record_stream request. Keys with lower priority are returned first, keys not present in the map get an implicit priority of 0, and are returned in lexicographical order. """ RecordingVersionedFilesDecorator.__init__(self, backing_vf) self._key_priority = key_priority def get_record_stream(self, keys, sort_order, include_delta_closure): """Get a stream of records with custom ordering and record the call. Args: keys: Keys to get records for. sort_order: How to sort the results ('unordered' uses key_priority). include_delta_closure: Whether to include delta closure. Yields: Record data in the specified order. """ self.calls.append( ("get_record_stream", list(keys), sort_order, include_delta_closure) ) if sort_order == "unordered": def sort_key(key): return (self._key_priority.get(key, 0), key) # Use a defined order by asking for the keys one-by-one from the # backing_vf for key in sorted(keys, key=sort_key): yield from self._backing_vf.get_record_stream( [key], "unordered", include_delta_closure ) else: yield from self._backing_vf.get_record_stream( keys, sort_order, include_delta_closure ) class KeyMapper: """KeyMappers map between keys and underlying partitioned storage.""" def map(self, key): """Map key to an underlying storage identifier. :param key: A key tuple e.g. (b'file-id', b'revision-id'). :return: An underlying storage identifier, specific to the partitioning mechanism. """ raise NotImplementedError(self.map) def unmap(self, partition_id): """Map a partitioned storage id back to a key prefix. :param partition_id: The underlying partition id. :return: As much of a key (or prefix) as is derivable from the partition id. """ raise NotImplementedError(self.unmap) class ConstantMapper(KeyMapper): """A key mapper that maps to a constant result.""" def __init__(self, result): """Create a ConstantMapper which will return result for all maps.""" self._result = result def map(self, key): """See KeyMapper.map().""" return self._result class URLEscapeMapper(KeyMapper): """Base class for use with transport backed storage. This provides a map and unmap wrapper that respectively url escape and unescape their outputs and inputs. """ def map(self, key): """See KeyMapper.map().""" return quote(self._map(key)) def unmap(self, partition_id): """See KeyMapper.unmap().""" return self._unmap(unquote(partition_id)) class PrefixMapper(URLEscapeMapper): """A key mapper that extracts the first component of a key. This mapper is for use with a transport based backend. """ def _map(self, key): """See KeyMapper.map().""" return key[0].decode("utf-8") def _unmap(self, partition_id): """See KeyMapper.unmap().""" return (partition_id.encode("utf-8"),) class HashPrefixMapper(URLEscapeMapper): """A key mapper that combines the first component of a key with a hash. This mapper is for use with a transport based backend. """ def _map(self, key): """See KeyMapper.map().""" prefix = self._escape(key[0]) return f"{adler32(prefix) & 255:02x}/{prefix.decode('utf-8')}" def _escape(self, prefix): """No escaping needed here.""" return prefix def _unmap(self, partition_id): """See KeyMapper.unmap().""" return (self._unescape(osutils.basename(partition_id)).encode("utf-8"),) def _unescape(self, basename): """No unescaping needed for HashPrefixMapper.""" return basename class HashEscapedPrefixMapper(HashPrefixMapper): """Combines the escaped first component of a key with a hash. This mapper is for use with a transport based backend. """ _safe = bytearray(b"abcdefghijklmnopqrstuvwxyz0123456789-_@,.") def _escape(self, prefix): """Turn a key element into a filesystem safe string. This is similar to a plain urllib.parse.quote, except it uses specific safe characters, so that it doesn't have to translate a lot of valid file ids. """ # @ does not get escaped. This is because it is a valid # filesystem character we use all the time, and it looks # a lot better than seeing %40 all the time. r = [((c in self._safe) and chr(c)) or (f"%{c:02x}") for c in bytearray(prefix)] return "".join(r).encode("ascii") def _unescape(self, basename): """Escaped names are easily unescaped by urllib.parse.unquote.""" return unquote(basename) def make_versioned_files_factory(versioned_file_factory, mapper): """Create a ThunkedVersionedFiles factory. This will create a callable which when called creates a ThunkedVersionedFiles on a transport, using mapper to access individual versioned files, and versioned_file_factory to create each individual file. """ def factory(transport): return ThunkedVersionedFiles( transport, versioned_file_factory, mapper, lambda: True ) return factory class VersionedFiles: """Storage for many versioned files. This object allows a single keyspace for accessing the history graph and contents of named bytestrings. Currently no implementation allows the graph of different key prefixes to intersect, but the API does allow such implementations in the future. The keyspace is expressed via simple tuples. Any instance of VersionedFiles may have a different length key-size, but that size will be constant for all texts added to or retrieved from it. For instance, bazaar uses instances with a key-size of 2 for storing user files in a repository, with the first element the fileid, and the second the version of that file. The use of tuples allows a single code base to support several different uses with only the mapping logic changing from instance to instance. :ivar _immediate_fallback_vfs: For subclasses that support stacking, this is a list of other VersionedFiles immediately underneath this one. They may in turn each have further fallbacks. """ def add_lines( self, key, parents, lines, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): r"""Add a text to the store. :param key: The key tuple of the text to add. If the last element is None, a CHK string will be generated during the addition. :param parents: The parents key tuples of the text to add. :param lines: A list of lines. Each line must be a bytestring. And all of them except the last must be terminated with \n and contain no other \n's. The last line may either contain no \n's or a single terminating \n. If the lines list does meet this constraint the add routine may error or may succeed - but you will be unable to read the data back accurately. (Checking the lines have been split correctly is expensive and extremely unlikely to catch bugs so it is not done at runtime unless check_content is True.) :param parent_texts: An optional dictionary containing the opaque representations of some or all of the parents of version_id to allow delta optimisations. VERY IMPORTANT: the texts must be those returned by add_lines or data corruption can be caused. :param left_matching_blocks: a hint about which areas are common between the text and its left-hand-parent. The format is the SequenceMatcher.get_matching_blocks format. :param nostore_sha: Raise ExistingContent and do not add the lines to the versioned file if the digest of the lines matches this. :param random_id: If True a random id has been selected rather than an id determined by some deterministic process such as a converter from a foreign VCS. When True the backend may choose not to check for uniqueness of the resulting key within the versioned file, so this should only be done when the result is expected to be unique anyway. :param check_content: If True, the lines supplied are verified to be bytestrings that are correctly formed lines. :return: The text sha1, the number of bytes in the text, and an opaque representation of the inserted version which can be provided back to future add_lines calls in the parent_texts dictionary. """ raise NotImplementedError(self.add_lines) def add_content( self, factory, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): """Add a text to the store from a chunk iterable. :param key: The key tuple of the text to add. If the last element is None, a CHK string will be generated during the addition. :param parents: The parents key tuples of the text to add. :param chunk_iter: An iterable over bytestrings. :param parent_texts: An optional dictionary containing the opaque representations of some or all of the parents of version_id to allow delta optimisations. VERY IMPORTANT: the texts must be those returned by add_lines or data corruption can be caused. :param left_matching_blocks: a hint about which areas are common between the text and its left-hand-parent. The format is the SequenceMatcher.get_matching_blocks format. :param nostore_sha: Raise ExistingContent and do not add the lines to the versioned file if the digest of the lines matches this. :param random_id: If True a random id has been selected rather than an id determined by some deterministic process such as a converter from a foreign VCS. When True the backend may choose not to check for uniqueness of the resulting key within the versioned file, so this should only be done when the result is expected to be unique anyway. :param check_content: If True, the lines supplied are verified to be bytestrings that are correctly formed lines. :return: The text sha1, the number of bytes in the text, and an opaque representation of the inserted version which can be provided back to future add_lines calls in the parent_texts dictionary. """ raise NotImplementedError(self.add_content) def add_mpdiffs(self, records): """Add mpdiffs to this VersionedFile. Records should be iterables of version, parents, expected_sha1, mpdiff. mpdiff should be a MultiParent instance. """ vf_parents = {} mpvf = multiparent.MultiMemoryVersionedFile() versions = [] for version, parent_ids, _expected_sha1, mpdiff in records: versions.append(version) mpvf.add_diff(mpdiff, version, parent_ids) needed_parents = set() for _version, parent_ids, _expected_sha1, _mpdiff in records: needed_parents.update(p for p in parent_ids if not mpvf.has_version(p)) # It seems likely that adding all the present parents as fulltexts can # easily exhaust memory. for record in self.get_record_stream(needed_parents, "unordered", True): if record.storage_kind == "absent": continue mpvf.add_version(record.get_bytes_as("lines"), record.key, []) for (key, parent_keys, expected_sha1, mpdiff), lines in zip( records, mpvf.get_line_list(versions), strict=False ): if len(parent_keys) == 1: left_matching_blocks = list( mpdiff.get_matching_blocks( 0, mpvf.get_diff(parent_keys[0]).num_lines() ) ) else: left_matching_blocks = None version_sha1, _, version_text = self.add_lines( key, parent_keys, lines, vf_parents, left_matching_blocks=left_matching_blocks, ) if version_sha1 != expected_sha1: raise VersionedFileInvalidChecksum(version) vf_parents[key] = version_text def annotate(self, key): """Return a list of (version-key, line) tuples for the text of key. :raise RevisionNotPresent: If the key is not present. """ raise NotImplementedError(self.annotate) def check(self, progress_bar=None): """Check this object for integrity. :param progress_bar: A progress bar to output as the check progresses. :param keys: Specific keys within the VersionedFiles to check. When this parameter is not None, check() becomes a generator as per get_record_stream. The difference to get_record_stream is that more or deeper checks will be performed. :return: None, or if keys was supplied a generator as per get_record_stream. """ raise NotImplementedError(self.check) @staticmethod def check_not_reserved_id(version_id): """Check that a version ID is not a reserved identifier. Args: version_id: The version ID to check, or None. Raises: ValueError: If version_id is a reserved identifier. """ if version_id is not None: revision.check_not_reserved_id(version_id) def clear_cache(self): """Clear whatever caches this VersionedFile holds. This is generally called after an operation has been performed, when we don't expect to be using this versioned file again soon. """ def _check_lines_not_unicode(self, lines): """Check that lines being added to a versioned file are not unicode.""" for line in lines: if line.__class__ is not bytes: raise TypeError("lines") def _check_lines_are_lines(self, lines): """Check that the lines really are full lines without inline EOL.""" for line in lines: if b"\n" in line[:-1]: raise ValueError("lines contain newlines") def get_known_graph_ancestry(self, keys): """Get a KnownGraph instance with the ancestry of keys.""" # most basic implementation is a loop around get_parent_map pending = set(keys) parent_map = {} while pending: this_parent_map = self.get_parent_map(pending) parent_map.update(this_parent_map) pending = set(itertools.chain.from_iterable(this_parent_map.values())) pending.difference_update(parent_map) kg = _mod_known_graph.KnownGraph(parent_map) return kg def get_parent_map(self, keys): """Get a map of the parents of keys. :param keys: The keys to look up parents for. :return: A mapping from keys to parents. Absent keys are absent from the mapping. """ raise NotImplementedError(self.get_parent_map) def get_record_stream(self, keys, ordering, include_delta_closure): """Get a stream of records for keys. :param keys: The keys to include. :param ordering: Either 'unordered' or 'topological'. A topologically sorted stream has compression parents strictly before their children. :param include_delta_closure: If True then the closure across any compression parents will be included (in the opaque data). :return: An iterator of ContentFactory objects, each of which is only valid until the iterator is advanced. """ raise NotImplementedError(self.get_record_stream) def get_sha1s(self, keys): """Get the sha1's of the texts for the given keys. :param keys: The names of the keys to lookup :return: a dict from key to sha1 digest. Keys of texts which are not present in the store are not present in the returned dictionary. """ raise NotImplementedError(self.get_sha1s) __contains__ = index._has_key_from_parent_map def get_missing_compression_parent_keys(self): """Return an iterable of keys of missing compression parents. Check this after calling insert_record_stream to find out if there are any missing compression parents. If there are, the records that depend on them are not able to be inserted safely. The precise behaviour depends on the concrete VersionedFiles class in use. Classes that do not support this will raise NotImplementedError. """ raise NotImplementedError(self.get_missing_compression_parent_keys) def insert_record_stream(self, stream): """Insert a record stream into this container. :param stream: A stream of records to insert. :return: None :seealso VersionedFile.get_record_stream: """ raise NotImplementedError def iter_lines_added_or_present_in_keys(self, keys, pb=None): r"""Iterate over the lines in the versioned files from keys. This may return lines from other keys. Each item the returned iterator yields is a tuple of a line and a text version that that line is present in (not introduced in). Ordering of results is in whatever order is most suitable for the underlying storage format. If a progress bar is supplied, it may be used to indicate progress. The caller is responsible for cleaning up progress bars (because this is an iterator). Notes: * Lines are normalised by the underlying store: they will all have \n terminators. * Lines are returned in arbitrary order. :return: An iterator over (line, key). """ raise NotImplementedError(self.iter_lines_added_or_present_in_keys) def keys(self): """Return a iterable of the keys for all the contained texts.""" raise NotImplementedError(self.keys) def make_mpdiffs(self, keys): """Create multiparent diffs for specified keys.""" generator = _MPDiffGenerator(self, keys) return generator.compute_diffs() def get_annotator(self): """Get an annotator for this versioned file. Returns: VersionedFileAnnotator: An annotator instance for this versioned file. """ from .annotate import VersionedFileAnnotator return VersionedFileAnnotator(self) missing_keys = index._missing_keys_from_parent_map def _extract_blocks(self, version_id, source, target): return None def _transitive_fallbacks(self): """Return the whole stack of fallback versionedfiles. This VersionedFiles may have a list of fallbacks, but it doesn't necessarily know about the whole stack going down, and it can't know at open time because they may change after the objects are opened. """ all_fallbacks = [] for a_vfs in self._immediate_fallback_vfs: all_fallbacks.append(a_vfs) all_fallbacks.extend(a_vfs._transitive_fallbacks()) return all_fallbacks class ThunkedVersionedFiles(VersionedFiles): """Storage for many versioned files thunked onto a 'VersionedFile' class. This object allows a single keyspace for accessing the history graph and contents of named bytestrings. Currently no implementation allows the graph of different key prefixes to intersect, but the API does allow such implementations in the future. """ def __init__(self, transport, file_factory, mapper, is_locked): """Create a ThunkedVersionedFiles.""" self._transport = transport self._file_factory = file_factory self._mapper = mapper self._is_locked = is_locked def add_content( self, factory, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, ): """See VersionedFiles.add_content().""" lines = factory.get_bytes_as("lines") return self.add_lines( factory.key, factory.parents, lines, parent_texts=parent_texts, left_matching_blocks=left_matching_blocks, nostore_sha=nostore_sha, random_id=random_id, check_content=True, ) def add_lines( self, key, parents, lines, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): """See VersionedFiles.add_lines().""" path = self._mapper.map(key) version_id = key[-1] parents = [parent[-1] for parent in parents] vf = self._get_vf(path) try: try: return vf.add_lines_with_ghosts( version_id, parents, lines, parent_texts=parent_texts, left_matching_blocks=left_matching_blocks, nostore_sha=nostore_sha, random_id=random_id, check_content=check_content, ) except NotImplementedError: return vf.add_lines( version_id, parents, lines, parent_texts=parent_texts, left_matching_blocks=left_matching_blocks, nostore_sha=nostore_sha, random_id=random_id, check_content=check_content, ) except TransportNoSuchFile: # parent directory may be missing, try again. self._transport.mkdir(osutils.dirname(path)) try: return vf.add_lines_with_ghosts( version_id, parents, lines, parent_texts=parent_texts, left_matching_blocks=left_matching_blocks, nostore_sha=nostore_sha, random_id=random_id, check_content=check_content, ) except NotImplementedError: return vf.add_lines( version_id, parents, lines, parent_texts=parent_texts, left_matching_blocks=left_matching_blocks, nostore_sha=nostore_sha, random_id=random_id, check_content=check_content, ) def annotate(self, key): """Return a list of (version-key, line) tuples for the text of key. :raise RevisionNotPresent: If the key is not present. """ prefix = key[:-1] path = self._mapper.map(prefix) vf = self._get_vf(path) origins = vf.annotate(key[-1]) result = [] for origin, line in origins: result.append((prefix + (origin,), line)) return result def check(self, progress_bar=None, keys=None): """See VersionedFiles.check().""" # XXX: This is over-enthusiastic but as we only thunk for Weaves today # this is tolerable. Ideally we'd pass keys down to check() and # have the older VersiondFile interface updated too. for _prefix, vf in self._iter_all_components(): vf.check() if keys is not None: return self.get_record_stream(keys, "unordered", True) def get_parent_map(self, keys): """Get a map of the parents of keys. :param keys: The keys to look up parents for. :return: A mapping from keys to parents. Absent keys are absent from the mapping. """ prefixes = self._partition_keys(keys) result = {} for prefix, suffixes in prefixes.items(): path = self._mapper.map(prefix) vf = self._get_vf(path) parent_map = vf.get_parent_map(suffixes) for key, parents in parent_map.items(): result[prefix + (key,)] = tuple( prefix + (parent,) for parent in parents ) return result def _get_vf(self, path): if not self._is_locked(): raise ObjectNotLocked(self) return self._file_factory( path, self._transport, create=True, get_scope=lambda: None ) def _partition_keys(self, keys): """Turn keys into a dict of prefix:suffix_list.""" result = {} for key in keys: prefix_keys = result.setdefault(key[:-1], []) prefix_keys.append(key[-1]) return result def _iter_all_prefixes(self): # Identify all key prefixes. # XXX: A bit hacky, needs polish. if isinstance(self._mapper, ConstantMapper): paths = [self._mapper.map(())] prefixes = [()] else: relpaths = set() for quoted_relpath in self._transport.iter_files_recursive(): path, _ext = os.path.splitext(quoted_relpath) relpaths.add(path) paths = list(relpaths) prefixes = [self._mapper.unmap(path) for path in paths] return zip(paths, prefixes, strict=False) def get_record_stream(self, keys, ordering, include_delta_closure): """See VersionedFiles.get_record_stream().""" # Ordering will be taken care of by each partitioned store; group keys # by partition. def add_prefix(p, k): return p + k keys = sorted(keys) for prefix, suffixes, vf in self._iter_keys_vf(keys): suffixes = [(suffix,) for suffix in suffixes] for record in vf.get_record_stream( suffixes, ordering, include_delta_closure ): record.map_key(functools.partial(add_prefix, prefix)) yield record def _iter_keys_vf(self, keys): prefixes = self._partition_keys(keys) for prefix, suffixes in prefixes.items(): path = self._mapper.map(prefix) vf = self._get_vf(path) yield prefix, suffixes, vf def get_sha1s(self, keys): """See VersionedFiles.get_sha1s().""" sha1s = {} for prefix, suffixes, vf in self._iter_keys_vf(keys): vf_sha1s = vf.get_sha1s(suffixes) for suffix, sha1 in vf_sha1s.items(): sha1s[prefix + (suffix,)] = sha1 return sha1s def insert_record_stream(self, stream): """Insert a record stream into this container. :param stream: A stream of records to insert. :return: None :seealso VersionedFile.get_record_stream: """ for record in stream: prefix = record.key[:-1] key = record.key[-1:] if record.parents is not None: parents = [parent[-1:] for parent in record.parents] else: parents = None thunk_record = AdapterFactory(key, parents, record) path = self._mapper.map(prefix) # Note that this parses the file many times; we can do better but # as this only impacts weaves in terms of performance, it is # tolerable. vf = self._get_vf(path) vf.insert_record_stream([thunk_record]) def iter_lines_added_or_present_in_keys(self, keys, pb=None): r"""Iterate over the lines in the versioned files from keys. This may return lines from other keys. Each item the returned iterator yields is a tuple of a line and a text version that that line is present in (not introduced in). Ordering of results is in whatever order is most suitable for the underlying storage format. If a progress bar is supplied, it may be used to indicate progress. The caller is responsible for cleaning up progress bars (because this is an iterator). Notes: * Lines are normalised by the underlying store: they will all have \n terminators. * Lines are returned in arbitrary order. :return: An iterator over (line, key). """ for prefix, suffixes, vf in self._iter_keys_vf(keys): for line, version in vf.iter_lines_added_or_present_in_versions(suffixes): yield line, prefix + (version,) def _iter_all_components(self): for path, prefix in self._iter_all_prefixes(): yield prefix, self._get_vf(path) def keys(self): """See VersionedFiles.keys().""" result = set() for prefix, vf in self._iter_all_components(): for suffix in vf.versions(): result.add(prefix + (suffix,)) return result class VersionedFilesWithFallbacks(VersionedFiles): """A versioned files implementation that supports fallback sources. This class extends VersionedFiles to provide support for fallback versioned files that can supply content not present in the primary versioned files. """ def without_fallbacks(self): """Return a clone of this object without any fallbacks configured.""" raise NotImplementedError(self.without_fallbacks) def add_fallback_versioned_files(self, a_versioned_files): """Add a source of texts for texts not present in this knit. :param a_versioned_files: A VersionedFiles object. """ raise NotImplementedError(self.add_fallback_versioned_files) def get_known_graph_ancestry(self, keys): """Get a KnownGraph instance with the ancestry of keys.""" parent_map, missing_keys = self._index.find_ancestry(keys) for fallback in self._transitive_fallbacks(): if not missing_keys: break (f_parent_map, f_missing_keys) = fallback._index.find_ancestry(missing_keys) parent_map.update(f_parent_map) missing_keys = f_missing_keys kg = _mod_known_graph.KnownGraph(parent_map) return kg class _PlanMergeVersionedFile(VersionedFiles): """A VersionedFile for uncommitted and committed texts. It is intended to allow merges to be planned with working tree texts. It implements only the small part of the VersionedFiles interface used by PlanMerge. It falls back to multiple versionedfiles for data not stored in _PlanMergeVersionedFile itself. :ivar: fallback_versionedfiles a list of VersionedFiles objects that can be queried for missing texts. """ def __init__(self, file_id): """Create a _PlanMergeVersionedFile. :param file_id: Used with _PlanMerge code which is not yet fully tuple-keyspace aware. """ self._file_id = file_id # fallback locations self.fallback_versionedfiles = [] # Parents for locally held keys. self._parents = {} # line data for locally held keys. self._lines = {} # key lookup providers self._providers = [_mod_graph.DictParentsProvider(self._parents)] def plan_merge(self, ver_a, ver_b, base=None): """See VersionedFile.plan_merge.""" from .merge import _PlanMerge if base is None: return _PlanMerge(ver_a, ver_b, self, (self._file_id,)).plan_merge() old_plan = list(_PlanMerge(ver_a, base, self, (self._file_id,)).plan_merge()) new_plan = list(_PlanMerge(ver_a, ver_b, self, (self._file_id,)).plan_merge()) return _PlanMerge._subtract_plans(old_plan, new_plan) def plan_lca_merge(self, ver_a, ver_b, base=None): from .merge import _PlanLCAMerge graph = _mod_graph.Graph(self) new_plan = _PlanLCAMerge( ver_a, ver_b, self, (self._file_id,), graph ).plan_merge() if base is None: return new_plan old_plan = _PlanLCAMerge( ver_a, base, self, (self._file_id,), graph ).plan_merge() return _PlanLCAMerge._subtract_plans(list(old_plan), list(new_plan)) def add_content(self, factory): return self.add_lines( factory.key, factory.parents, factory.get_bytes_as("lines") ) def add_lines(self, key, parents, lines): """See VersionedFiles.add_lines. Lines are added locally, not to fallback versionedfiles. Also, ghosts are permitted. Only reserved ids are permitted. """ if not isinstance(key, tuple): raise TypeError(key) if not revision.is_reserved_id(key[-1]): raise ValueError("Only reserved ids may be used") if parents is None: raise ValueError("Parents may not be None") if lines is None: raise ValueError("Lines may not be None") self._parents[key] = tuple(parents) self._lines[key] = lines def get_record_stream(self, keys, ordering, include_delta_closure): pending = set(keys) for key in keys: if key in self._lines: lines = self._lines[key] parents = self._parents[key] pending.remove(key) yield ChunkedContentFactory(key, parents, None, lines) for versionedfile in self.fallback_versionedfiles: for record in versionedfile.get_record_stream(pending, "unordered", True): if record.storage_kind == "absent": continue else: pending.remove(record.key) yield record if not pending: return # report absent entries for key in pending: yield AbsentContentFactory(key) def get_parent_map(self, keys): """See VersionedFiles.get_parent_map.""" # We create a new provider because a fallback may have been added. # If we make fallbacks private we can update a stack list and avoid # object creation thrashing. keys = set(keys) result = {} if revision.NULL_REVISION in keys: keys.remove(revision.NULL_REVISION) result[revision.NULL_REVISION] = () self._providers = self._providers[:1] + self.fallback_versionedfiles result.update( _mod_graph.StackedParentsProvider(self._providers).get_parent_map(keys) ) for key, parents in result.items(): if parents == (): result[key] = (revision.NULL_REVISION,) return result class PlanWeaveMerge(TextMerge): """Weave merge that takes a plan as its input. This exists so that VersionedFile.plan_merge is implementable. Most callers will want to use WeaveMerge instead. """ def __init__(self, plan, a_marker=TextMerge.A_MARKER, b_marker=TextMerge.B_MARKER): """Initialize a PlanWeaveMerge. Args: plan: The merge plan to execute. a_marker: Marker for 'A' side conflicts (optional). b_marker: Marker for 'B' side conflicts (optional). """ TextMerge.__init__(self, a_marker, b_marker) self.plan = list(plan) def _merge_struct(self): lines_a = [] lines_b = [] ch_a = ch_b = False def outstanding_struct(): if not lines_a and not lines_b: return elif ch_a and not ch_b: # one-sided change: yield (lines_a,) elif ch_b and not ch_a: yield (lines_b,) elif lines_a == lines_b: yield (lines_a,) else: yield (lines_a, lines_b) # We previously considered either 'unchanged' or 'killed-both' lines # to be possible places to resynchronize. However, assuming agreement # on killed-both lines may be too aggressive. -- mbp 20060324 for state, line in self.plan: if state == "unchanged": # resync and flush queued conflicts changes if any yield from outstanding_struct() lines_a = [] lines_b = [] ch_a = ch_b = False if state == "unchanged": if line: yield ([line],) elif state == "killed-a": ch_a = True lines_b.append(line) elif state == "killed-b": ch_b = True lines_a.append(line) elif state == "new-a": ch_a = True lines_a.append(line) elif state == "new-b": ch_b = True lines_b.append(line) elif state == "conflicted-a": ch_b = ch_a = True lines_a.append(line) elif state == "conflicted-b": ch_b = ch_a = True lines_b.append(line) elif state == "killed-both": # This counts as a change, even though there is no associated # line ch_b = ch_a = True else: if state not in ("irrelevant", "ghost-a", "ghost-b", "killed-base"): raise AssertionError(state) yield from outstanding_struct() def base_from_plan(self): """Construct a BASE file from the plan text.""" base_lines = [] for state, line in self.plan: if state in ("killed-a", "killed-b", "killed-both", "unchanged"): # If unchanged, then this line is straight from base. If a or b # or both killed the line, then it *used* to be in base. base_lines.append(line) else: if state not in ( "killed-base", "irrelevant", "ghost-a", "ghost-b", "new-a", "new-b", "conflicted-a", "conflicted-b", ): # killed-base, irrelevant means it doesn't apply # ghost-a/ghost-b are harder to say for sure, but they # aren't in the 'inc_c' which means they aren't in the # shared base of a & b. So we don't include them. And # obviously if the line is newly inserted, it isn't in base # If 'conflicted-a' or b, then it is new vs one base, but # old versus another base. However, if we make it present # in the base, it will be deleted from the target, and it # seems better to get a line doubled in the merge result, # rather than have it deleted entirely. # Example, each node is the 'text' at that point: # MN # / \ # MaN MbN # | X | # MabN MbaN # \ / # ??? # There was a criss-cross conflict merge. Both sides # include the other, but put themselves first. # Weave marks this as a 'clean' merge, picking OTHER over # THIS. (Though the details depend on order inserted into # weave, etc.) # LCA generates a plan: # [('unchanged', M), # ('conflicted-b', b), # ('unchanged', a), # ('conflicted-a', b), # ('unchanged', N)] # If you mark 'conflicted-*' as part of BASE, then a 3-way # merge tool will cleanly generate "MaN" (as BASE vs THIS # removes one 'b', and BASE vs OTHER removes the other) # If you include neither, 3-way creates a clean "MbabN" as # THIS adds one 'b', and OTHER does too. # It seems that having the line 2 times is better than # having it omitted. (Easier to manually delete than notice # it needs to be added.) raise AssertionError(f"Unknown state: {state}") return base_lines class WeaveMerge(PlanWeaveMerge): """Weave merge that takes a VersionedFile and two versions as its input.""" def __init__( self, versionedfile, ver_a, ver_b, a_marker=PlanWeaveMerge.A_MARKER, b_marker=PlanWeaveMerge.B_MARKER, ): """Initialize a WeaveMerge. Args: versionedfile: The versioned file containing the versions to merge. ver_a: First version ID to merge. ver_b: Second version ID to merge. a_marker: Marker for 'A' side conflicts (optional). b_marker: Marker for 'B' side conflicts (optional). """ plan = versionedfile.plan_merge(ver_a, ver_b) PlanWeaveMerge.__init__(self, plan, a_marker, b_marker) class VirtualVersionedFiles(VersionedFiles): """Dummy implementation for VersionedFiles that uses other functions for obtaining fulltexts and parent maps. This is always on the bottom of the stack and uses string keys (rather than tuples) internally. """ def __init__(self, get_parent_map, get_lines): """Create a VirtualVersionedFiles. :param get_parent_map: Same signature as Repository.get_parent_map. :param get_lines: Should return lines for specified key or None if not available. """ super().__init__() self._get_parent_map = get_parent_map self._get_lines = get_lines def check(self, progressbar=None): """See VersionedFiles.check. :note: Always returns True for VirtualVersionedFiles. """ return True def add_mpdiffs(self, records): """See VersionedFiles.mpdiffs. :note: Not implemented for VirtualVersionedFiles. """ raise NotImplementedError(self.add_mpdiffs) def get_parent_map(self, keys): """See VersionedFiles.get_parent_map.""" parent_view = self._get_parent_map(k for (k,) in keys).items() return {(k,): tuple((p,) for p in v) for k, v in parent_view} def get_sha1s(self, keys): """See VersionedFiles.get_sha1s.""" ret = {} for (k,) in keys: lines = self._get_lines(k) if lines is not None: if not isinstance(lines, list): raise AssertionError ret[(k,)] = osutils.sha_strings(lines) return ret def get_record_stream(self, keys, ordering, include_delta_closure): """See VersionedFiles.get_record_stream.""" for (k,) in list(keys): lines = self._get_lines(k) if lines is not None: if not isinstance(lines, list): raise AssertionError yield ChunkedContentFactory( (k,), None, sha1=osutils.sha_strings(lines), chunks=lines, ) else: yield AbsentContentFactory((k,)) def iter_lines_added_or_present_in_keys(self, keys, pb=None): """See VersionedFile.iter_lines_added_or_present_in_versions().""" for i, (key,) in enumerate(keys): if pb is not None: pb.update("Finding changed lines", i, len(keys)) for l in self._get_lines(key): yield (l, key) class NoDupeAddLinesDecorator: """Decorator for a VersionedFiles that skips doing an add_lines if the key is already present. """ def __init__(self, store): """Initialize a NoDupeAddLinesDecorator. Args: store: The underlying versioned files store to decorate. """ self._store = store def add_lines( self, key, parents, lines, parent_texts=None, left_matching_blocks=None, nostore_sha=None, random_id=False, check_content=True, ): """See VersionedFiles.add_lines. This implementation may return None as the third element of the return value when the original store wouldn't. """ if nostore_sha: raise NotImplementedError( "NoDupeAddLinesDecorator.add_lines does not implement the " "nostore_sha behaviour." ) if key[-1] is None: sha1 = osutils.sha_strings(lines) key = (b"sha1:" + sha1,) else: sha1 = None if key in self._store.get_parent_map([key]): # This key has already been inserted, so don't do it again. if sha1 is None: sha1 = osutils.sha_strings(lines) return sha1, sum(map(len, lines)), None return self._store.add_lines( key, parents, lines, parent_texts=parent_texts, left_matching_blocks=left_matching_blocks, nostore_sha=nostore_sha, random_id=random_id, check_content=check_content, ) def __getattr__(self, name): """Delegate attribute access to the underlying store. Args: name: Name of the attribute to access. Returns: The attribute value from the underlying store. """ return getattr(self._store, name) def network_bytes_to_kind_and_offset(network_bytes): """Strip of a record kind from the front of network_bytes. :param network_bytes: The bytes of a record. :return: A tuple (storage_kind, offset_of_remaining_bytes) """ line_end = network_bytes.find(b"\n") storage_kind = network_bytes[:line_end].decode("ascii") return storage_kind, line_end + 1 class NetworkRecordStream: """A record_stream which reconstitures a serialised stream.""" def __init__(self, bytes_iterator): """Create a NetworkRecordStream. :param bytes_iterator: An iterator of bytes. Each item in this iterator should have been obtained from a record_streams' record.get_bytes_as(record.storage_kind) call. """ from . import groupcompress, knit self._bytes_iterator = bytes_iterator self._kind_factory = { "fulltext": fulltext_network_to_record, "groupcompress-block": groupcompress.network_block_to_records, "knit-ft-gz": knit.knit_network_to_record, "knit-delta-gz": knit.knit_network_to_record, "knit-annotated-ft-gz": knit.knit_network_to_record, "knit-annotated-delta-gz": knit.knit_network_to_record, "knit-delta-closure": knit.knit_delta_closure_to_records, } def read(self): """Read the stream. :return: An iterator as per VersionedFiles.get_record_stream(). """ for bytes in self._bytes_iterator: storage_kind, line_end = network_bytes_to_kind_and_offset(bytes) yield from self._kind_factory[storage_kind](storage_kind, bytes, line_end) def sort_groupcompress(parent_map): """Sort and group the keys in parent_map into groupcompress order. groupcompress is defined (currently) as reverse-topological order, grouped by the key prefix. :return: A sorted-list of keys """ from vcsgraph.tsort import topo_sort # gc-optimal ordering is approximately reverse topological, # properly grouped by file-id. per_prefix_map = {} for item in parent_map.items(): key = item[0] prefix = b"" if isinstance(key, bytes) or len(key) == 1 else key[0] try: per_prefix_map[prefix].append(item) except KeyError: per_prefix_map[prefix] = [item] present_keys = [] for prefix in sorted(per_prefix_map): present_keys.extend(reversed(topo_sort(per_prefix_map[prefix]))) return present_keys class _KeyRefs: def __init__(self, track_new_keys=False): # dict mapping 'key' to 'set of keys referring to that key' self.refs = {} if track_new_keys: # set remembering all new keys self.new_keys = set() else: self.new_keys = None def clear(self): if self.refs: self.refs.clear() if self.new_keys: self.new_keys.clear() def add_references(self, key, refs): # Record the new references for referenced in refs: try: needed_by = self.refs[referenced] except KeyError: needed_by = self.refs[referenced] = set() needed_by.add(key) # Discard references satisfied by the new key self.add_key(key) def get_new_keys(self): return self.new_keys def get_unsatisfied_refs(self): return self.refs.keys() def _satisfy_refs_for_key(self, key): try: del self.refs[key] except KeyError: # No keys depended on this key. That's ok. pass def add_key(self, key): # satisfy refs for key, and remember that we've seen this key. self._satisfy_refs_for_key(key) if self.new_keys is not None: self.new_keys.add(key) def satisfy_refs_for_keys(self, keys): for key in keys: self._satisfy_refs_for_key(key) def get_referrers(self): return set(itertools.chain.from_iterable(self.refs.values())) bzrformats_3.4.0.orig/bzrformats/weave.py0000755000000000000000000012506015162115103015532 0ustar00# Copyright (C) 2005, 2009 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # Author: Martin Pool """Weave - storage of related text file versions.""" # XXX: If we do weaves this way, will a merge still behave the same # way if it's done in a different order? That's a pretty desirable # property. # TODO: Nothing here so far assumes the lines are really \n newlines, # rather than being split up in some other way. We could accommodate # binaries, perhaps by naively splitting on \n or perhaps using # something like a rolling checksum. # TODO: End marker for each version so we can stop reading? # TODO: Check that no insertion occurs inside a deletion that was # active in the version of the insertion. # TODO: In addition to the SHA-1 check, perhaps have some code that # checks structural constraints of the weave: ie that insertions are # properly nested, that there is no text outside of an insertion, that # insertions or deletions are not repeated, etc. # TODO: Parallel-extract that passes back each line along with a # description of which revisions include it. Nice for checking all # shas or calculating stats in parallel. # TODO: Using a single _extract routine and then processing the output # is probably inefficient. It's simple enough that we can afford to # have slight specializations for different ways its used: annotate, # basis for add, get, etc. # TODO: Probably the API should work only in names to hide the integer # indexes from the user. # TODO: Is there any potential performance win by having an add() # variant that is passed a pre-cooked version of the single basis # version? # TODO: Reweave can possibly be made faster by remembering diffs # where the basis and destination are unchanged. # FIXME: Sometimes we will be given a parents list for a revision # that includes some redundant parents (i.e. already a parent of # something in the list.) We should eliminate them. This can # be done fairly efficiently because the sequence numbers constrain # the possible relationships. # FIXME: the conflict markers should be *7* characters import contextlib import hashlib import logging import os from copy import copy from io import BytesIO import patiencediff from .errors import ( BzrFormatsError, OutSideTransaction, ReadOnlyObjectDirtiedError, RevisionAlreadyPresent, RevisionNotPresent, ) from .osutils import sha_strings from .revision import NULL_REVISION from .transport import TransportNoSuchFile from .versionedfile import ( AbsentContentFactory, ContentFactory, ExistingContent, UnavailableRepresentation, VersionedFile, adapter_registry, sort_groupcompress, ) from .weavefile import _read_weave_v5, write_weave_v5 logger = logging.getLogger("bzrformats.weave") class WeaveError(BzrFormatsError): """Base class for weave-related errors.""" _fmt = "Error in processing weave: %(msg)s" def __init__(self, msg=None): """Initialize WeaveError with optional message. Args: msg: Optional error message. """ super().__init__() self.msg = msg class WeaveRevisionAlreadyPresent(WeaveError): """Error raised when attempting to add a revision that already exists.""" _fmt = "Revision {%(revision_id)s} already present in %(weave)s" def __init__(self, revision_id, weave): """Initialize WeaveRevisionAlreadyPresent error. Args: revision_id: The revision ID that already exists. weave: The weave object. """ super().__init__() self.revision_id = revision_id self.weave = weave class WeaveRevisionNotPresent(WeaveError): """Error raised when requesting a revision that doesn't exist.""" _fmt = "Revision {%(revision_id)s} not present in %(weave)s" def __init__(self, revision_id, weave): """Initialize WeaveRevisionNotPresent error. Args: revision_id: The revision ID that was not found. weave: The weave object. """ super().__init__() self.revision_id = revision_id self.weave = weave class WeaveFormatError(WeaveError): """Error raised when weave format is invalid or invariants are violated.""" _fmt = "Weave invariant violated: %(what)s" def __init__(self, what): """Initialize WeaveFormatError. Args: what: Description of the format error or invariant violation. """ super().__init__() self.what = what class WeaveParentMismatch(WeaveError): """Error raised when parent information doesn't match between revisions.""" _fmt = "Parents are mismatched between two revisions. %(msg)s" class WeaveInvalidChecksum(WeaveError): """Error raised when text content doesn't match its expected checksum.""" _fmt = "Text did not match its checksum: %(msg)s" class WeaveTextDiffers(WeaveError): """Error raised when two weaves have different text content for the same revision.""" _fmt = ( "Weaves differ on text content. Revision:" " {%(revision_id)s}, %(weave_a)s, %(weave_b)s" ) def __init__(self, revision_id, weave_a, weave_b): """Initialize WeaveTextDiffers error. Args: revision_id: The revision ID where text differs. weave_a: First weave with differing text. weave_b: Second weave with differing text. """ super().__init__() self.revision_id = revision_id self.weave_a = weave_a self.weave_b = weave_b class WeaveContentFactory(ContentFactory): """Content factory for streaming from weaves. :seealso ContentFactory: """ def __init__(self, version, weave): """Create a WeaveContentFactory for version from weave.""" ContentFactory.__init__(self) self.sha1 = weave.get_sha1s([version])[version] self.key = (version,) parents = weave.get_parent_map([version])[version] self.parents = tuple((parent,) for parent in parents) self.storage_kind = "fulltext" self._weave = weave def get_bytes_as(self, storage_kind): """Get content bytes in the specified storage format. Args: storage_kind: The format to return content in ('fulltext', 'chunked', or 'lines'). Returns: Content in the requested format. Raises: UnavailableRepresentation: If the storage_kind is not supported. """ if storage_kind == "fulltext": return self._weave.get_text(self.key[-1]) elif storage_kind in ("chunked", "lines"): return self._weave.get_lines(self.key[-1]) else: raise UnavailableRepresentation(self.key, storage_kind, "fulltext") def iter_bytes_as(self, storage_kind): """Iterate over content bytes in the specified storage format. Args: storage_kind: The format to iterate content in ('chunked' or 'lines'). Returns: Iterator over content lines. Raises: UnavailableRepresentation: If the storage_kind is not supported. """ if storage_kind in ("chunked", "lines"): return iter(self._weave.get_lines(self.key[-1])) else: raise UnavailableRepresentation(self.key, storage_kind, "fulltext") class Weave(VersionedFile): """weave - versioned text file storage. A Weave manages versions of line-based text files, keeping track of the originating version for each line. To clients the "lines" of the file are represented as a list of strings. These strings will typically have terminal newline characters, but this is not required. In particular files commonly do not have a newline at the end of the file. Texts can be identified in either of two ways: * a nonnegative index number. * a version-id string. Typically the index number will be valid only inside this weave and the version-id is used to reference it in the larger world. The weave is represented as a list mixing edit instructions and literal text. Each entry in _weave can be either a string (or unicode), or a tuple. If a string, it means that the given line should be output in the currently active revisions. If a tuple, it gives a processing instruction saying in which revisions the enclosed lines are active. The tuple has the form (instruction, version). The instruction can be '{' or '}' for an insertion block, and '[' and ']' for a deletion block respectively. The version is the integer version index. There is no replace operator, only deletes and inserts. For '}', the end of an insertion, there is no version parameter because it always closes the most recently opened insertion. Constraints/notes: * A later version can delete lines that were introduced by any number of ancestor versions; this implies that deletion instructions can span insertion blocks without regard to the insertion block's nesting. * Similarly, deletions need not be properly nested with regard to each other, because they might have been generated by independent revisions. * Insertions are always made by inserting a new bracketed block into a single point in the previous weave. This implies they can nest but not overlap, and the nesting must always have later insertions on the inside. * It doesn't seem very useful to have an active insertion inside an inactive insertion, but it might happen. * Therefore, all instructions are always"considered"; that is passed onto and off the stack. An outer inactive block doesn't disable an inner block. * Lines are enabled if the most recent enclosing insertion is active and none of the enclosing deletions are active. * There is no point having a deletion directly inside its own insertion; you might as well just not write it. And there should be no way to get an earlier version deleting a later version. _weave Text of the weave; list of control instruction tuples and strings. _parents List of parents, indexed by version number. It is only necessary to store the minimal set of parents for each version; the parent's parents are implied. _sha1s List of hex SHA-1 of each version. _names List of symbolic names for each version. Each should be unique. _name_map For each name, the version number. _weave_name Descriptive name of this weave; typically the filename if known. Set by read_weave. """ __slots__ = [ "_allow_reserved", "_matcher", "_name_map", "_names", "_parents", "_sha1s", "_weave", "_weave_name", ] def __init__( self, weave_name=None, access_mode="w", matcher=None, get_scope=None, allow_reserved=False, ): """Create a weave. :param get_scope: A callable that returns an opaque object to be used for detecting when this weave goes out of scope (should stop answering requests or allowing mutation). """ super().__init__() self._weave = [] self._parents = [] self._sha1s = [] self._names = [] self._name_map = {} self._weave_name = weave_name if matcher is None: self._matcher = patiencediff.PatienceSequenceMatcher else: self._matcher = matcher if get_scope is None: def get_scope(): return None self._get_scope = get_scope self._scope = get_scope() self._access_mode = access_mode self._allow_reserved = allow_reserved def __repr__(self): """Return string representation of this weave.""" return f"Weave({self._weave_name!r})" def _check_write_ok(self): """Is the versioned file marked as 'finished' ? Raise if it is.""" if self._get_scope() != self._scope: raise OutSideTransaction() if self._access_mode != "w": raise ReadOnlyObjectDirtiedError(self) def copy(self): """Return a deep copy of self. The copy can be modified without affecting the original weave. """ other = Weave() other._weave = self._weave[:] other._parents = self._parents[:] other._sha1s = self._sha1s[:] other._names = self._names[:] other._name_map = self._name_map.copy() other._weave_name = self._weave_name return other def __eq__(self, other): """Check if two weaves are equal. Args: other: Another object to compare with. Returns: True if weaves are equal, False otherwise. """ if not isinstance(other, Weave): return False return ( self._parents == other._parents and self._weave == other._weave and self._sha1s == other._sha1s ) def __ne__(self, other): """Check if two weaves are not equal. Args: other: Another object to compare with. Returns: True if weaves are not equal, False otherwise. """ return not self.__eq__(other) def _idx_to_name(self, version): return self._names[version] def _lookup(self, name): """Convert symbolic version name to index.""" if not self._allow_reserved: self.check_not_reserved_id(name) try: return self._name_map[name] except KeyError as e: raise RevisionNotPresent(name, self._weave_name) from e def versions(self): """See VersionedFile.versions.""" return self._names[:] def has_version(self, version_id): """See VersionedFile.has_version.""" return version_id in self._name_map __contains__ = has_version def get_record_stream(self, versions, ordering, include_delta_closure): """Get a stream of records for versions. :param versions: The versions to include. Each version is a tuple (version,). :param ordering: Either 'unordered' or 'topological'. A topologically sorted stream has compression parents strictly before their children. :param include_delta_closure: If True then the closure across any compression parents will be included (in the opaque data). :return: An iterator of ContentFactory objects, each of which is only valid until the iterator is advanced. """ import vcsgraph.tsort as tsort versions = [version[-1] for version in versions] if ordering == "topological": parents = self.get_parent_map(versions) new_versions = tsort.topo_sort(parents) new_versions.extend(set(versions).difference(set(parents))) versions = new_versions elif ordering == "groupcompress": parents = self.get_parent_map(versions) new_versions = sort_groupcompress(parents) new_versions.extend(set(versions).difference(set(parents))) versions = new_versions for version in versions: if version in self: yield WeaveContentFactory(version, self) else: yield AbsentContentFactory((version,)) def get_parent_map(self, version_ids): """See VersionedFile.get_parent_map.""" result = {} for version_id in version_ids: if version_id == NULL_REVISION: parents = () else: try: parents = tuple( map(self._idx_to_name, self._parents[self._lookup(version_id)]) ) except RevisionNotPresent: continue result[version_id] = parents return result def get_parents_with_ghosts(self, version_id): """Get parents including ghost revisions (not implemented for weaves). Args: version_id: The version to get parents for. Raises: NotImplementedError: Weaves don't support ghost revisions. """ raise NotImplementedError(self.get_parents_with_ghosts) def insert_record_stream(self, stream): """Insert a record stream into this versioned file. :param stream: A stream of records to insert. :return: None :seealso VersionedFile.get_record_stream: """ adapters = {} for record in stream: # Raise an error when a record is missing. if record.storage_kind == "absent": raise RevisionNotPresent([record.key[0]], self) # adapt to non-tuple interface parents = [parent[0] for parent in record.parents] if record.storage_kind in ("fulltext", "chunked", "lines"): self.add_lines(record.key[0], parents, record.get_bytes_as("lines")) else: adapter_key = record.storage_kind, "lines" try: adapter = adapters[adapter_key] except KeyError: adapter_factory = adapter_registry.get(adapter_key) adapter = adapter_factory(self) adapters[adapter_key] = adapter lines = adapter.get_bytes(record, "lines") with contextlib.suppress(RevisionAlreadyPresent): self.add_lines(record.key[0], parents, lines) def _check_repeated_add(self, name, parents, text, sha1): """Check that a duplicated add is OK. If it is, return the (old) index; otherwise raise an exception. """ idx = self._lookup(name) if sorted(self._parents[idx]) != sorted(parents) or sha1 != self._sha1s[idx]: raise RevisionAlreadyPresent(name, self._weave_name) return idx def _add_lines( self, version_id, parents, lines, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ): """See VersionedFile.add_lines.""" idx = self._add( version_id, lines, list(map(self._lookup, parents)), nostore_sha=nostore_sha ) return sha_strings(lines), sum(map(len, lines)), idx def _add(self, version_id, lines, parents, sha1=None, nostore_sha=None): """Add a single text on top of the weave. Returns the index number of the newly added version. version_id Symbolic name for this version. (Typically the revision-id of the revision that added it.) If None, a name will be allocated based on the hash. (sha1:SHAHASH) parents List or set of direct parent version numbers. lines Sequence of lines to be added in the new version. :param nostore_sha: See VersionedFile.add_lines. """ self._check_lines_not_unicode(lines) self._check_lines_are_lines(lines) if not sha1: sha1 = sha_strings(lines) if sha1 == nostore_sha: raise ExistingContent if version_id is None: version_id = b"sha1:" + sha1 if version_id in self._name_map: return self._check_repeated_add(version_id, parents, lines, sha1) self._check_versions(parents) new_version = len(self._parents) # if we abort after here the (in-memory) weave will be corrupt because # only some fields are updated # XXX: FIXME implement a succeed-or-fail of the rest of this routine. # - Robert Collins 20060226 self._parents.append(parents[:]) self._sha1s.append(sha1) self._names.append(version_id) self._name_map[version_id] = new_version if not parents: # special case; adding with no parents revision; can do # this more quickly by just appending unconditionally. # even more specially, if we're adding an empty text we # need do nothing at all. if lines: self._weave.append((b"{", new_version)) self._weave.extend(lines) self._weave.append((b"}", None)) return new_version if len(parents) == 1: pv = list(parents)[0] if sha1 == self._sha1s[pv]: # special case: same as the single parent return new_version ancestors = self._inclusions(parents) # basis a list of (origin, lineno, line) basis_lineno = [] basis_lines = [] for _origin, lineno, line in self._extract(ancestors): basis_lineno.append(lineno) basis_lines.append(line) # another small special case: a merge, producing the same text # as auto-merge if lines == basis_lines: return new_version # add a sentinel, because we can also match against the final line basis_lineno.append(len(self._weave)) # XXX: which line of the weave should we really consider # matches the end of the file? the current code says it's the # last line of the weave? # print 'basis_lines:', basis_lines # print 'new_lines: ', lines s = self._matcher(None, basis_lines, lines) # offset gives the number of lines that have been inserted # into the weave up to the current point; if the original edit # instruction says to change line A then we actually change (A+offset) offset = 0 for tag, i1, i2, j1, j2 in s.get_opcodes(): # i1,i2 are given in offsets within basis_lines; we need to map # them back to offsets within the entire weave print 'raw match', # tag, i1, i2, j1, j2 if tag == "equal": continue i1 = basis_lineno[i1] i2 = basis_lineno[i2] # the deletion and insertion are handled separately. # first delete the region. if i1 != i2: self._weave.insert(i1 + offset, (b"[", new_version)) self._weave.insert(i2 + offset + 1, (b"]", new_version)) offset += 2 if j1 != j2: # there may have been a deletion spanning up to # i2; we want to insert after this region to make sure # we don't destroy ourselves i = i2 + offset self._weave[i:i] = [(b"{", new_version)] + lines[j1:j2] + [(b"}", None)] offset += 2 + (j2 - j1) return new_version def _inclusions(self, versions): """Return set of all ancestors of given version(s).""" if not len(versions): return set() i = set(versions) for v in range(max(versions), 0, -1): if v in i: # include all its parents i.update(self._parents[v]) return i def get_ancestry(self, version_ids, topo_sorted=True): """See VersionedFile.get_ancestry.""" if isinstance(version_ids, bytes): version_ids = [version_ids] i = self._inclusions([self._lookup(v) for v in version_ids]) return {self._idx_to_name(v) for v in i} def _check_versions(self, indexes): """Check everything in the sequence of indexes is valid.""" for i in indexes: try: self._parents[i] except IndexError as err: raise IndexError(f"invalid version number {i!r}") from err def _compatible_parents(self, my_parents, other_parents): """During join check that other_parents are joinable with my_parents. Joinable is defined as 'is a subset of' - supersets may require regeneration of diffs, but subsets do not. """ return len(other_parents.difference(my_parents)) == 0 def annotate(self, version_id): """Return a list of (version-id, line) tuples for version_id. The index indicates when the line originated in the weave. """ incls = [self._lookup(version_id)] return [ (self._idx_to_name(origin), text) for origin, lineno, text in self._extract(incls) ] def iter_lines_added_or_present_in_versions(self, version_ids=None, pb=None): """See VersionedFile.iter_lines_added_or_present_in_versions().""" if version_ids is None: version_ids = self.versions() version_ids = set(version_ids) for _lineno, inserted, _deletes, line in self._walk_internal(version_ids): if inserted not in version_ids: continue if not line.endswith(b"\n"): yield line + b"\n", inserted else: yield line, inserted def _walk_internal(self, version_ids=None): """Helper method for weave actions.""" istack = [] dset = set() for lineno, l in enumerate(self._weave): if isinstance(l, tuple): c, v = l if c == b"{": istack.append(self._names[v]) elif c == b"}": istack.pop() elif c == b"[": dset.add(self._names[v]) elif c == b"]": dset.remove(self._names[v]) else: raise WeaveFormatError(f"unexpected instruction {v!r}") else: yield lineno, istack[-1], frozenset(dset), l if istack: raise WeaveFormatError( "unclosed insertion blocks at end of weave: {}".format(istack) ) if dset: raise WeaveFormatError(f"unclosed deletion blocks at end of weave: {dset}") def plan_merge(self, ver_a, ver_b): """Return pseudo-annotation indicating how the two versions merge. This is computed between versions a and b and their common base. Weave lines present in none of them are skipped entirely. """ inc_a = self.get_ancestry([ver_a]) inc_b = self.get_ancestry([ver_b]) inc_c = inc_a & inc_b for _lineno, insert, deleteset, line in self._walk_internal([ver_a, ver_b]): if deleteset & inc_c: # killed in parent; can't be in either a or b # not relevant to our work yield "killed-base", line elif insert in inc_c: # was inserted in base killed_a = bool(deleteset & inc_a) killed_b = bool(deleteset & inc_b) if killed_a and killed_b: yield "killed-both", line elif killed_a: yield "killed-a", line elif killed_b: yield "killed-b", line else: yield "unchanged", line elif insert in inc_a: if deleteset & inc_a: yield "ghost-a", line else: # new in A; not in B yield "new-a", line elif insert in inc_b: if deleteset & inc_b: yield "ghost-b", line else: yield "new-b", line else: # not in either revision yield "irrelevant", line def _extract(self, versions): """Yield annotation of lines in included set. Yields a sequence of tuples (origin, lineno, text), where origin is the origin version, lineno the index in the weave, and text the text of the line. The set typically but not necessarily corresponds to a version. """ for i in versions: if not isinstance(i, int): raise ValueError(i) included = self._inclusions(versions) istack = [] iset = set() dset = set() isactive = None result = [] # wow. # 449 0 4474.6820 2356.5590 bzrformats.weave:556(_extract) # +285282 0 1676.8040 1676.8040 + # 1.6 seconds in 'isinstance'. # changing the first isinstance: # 449 0 2814.2660 1577.1760 bzrformats.weave:556(_extract) # +140414 0 762.8050 762.8050 + # note that the inline time actually dropped (less function calls) # and total processing time was halved. # we're still spending ~1/4 of the method in isinstance though. # so lets hard code the acceptable string classes we expect: # 449 0 1202.9420 786.2930 bzrformats.weave:556(_extract) # +71352 0 377.5560 377.5560 + # yay, down to ~1/4 the initial extract time, and our inline time # has shrunk again, with isinstance no longer dominating. # tweaking the stack inclusion test to use a set gives: # 449 0 1122.8030 713.0080 bzrformats.weave:556(_extract) # +71352 0 354.9980 354.9980 + # - a 5% win, or possibly just noise. However with large istacks that # 'in' test could dominate, so I'm leaving this change in place - when # its fast enough to consider profiling big datasets we can review. for lineno, l in enumerate(self._weave): if isinstance(l, tuple): c, v = l isactive = None if c == b"{": istack.append(v) iset.add(v) elif c == b"}": iset.remove(istack.pop()) elif c == b"[": if v in included: dset.add(v) elif c == b"]": if v in included: dset.remove(v) else: raise AssertionError() else: if isactive is None: isactive = (not dset) and istack and (istack[-1] in included) if isactive: result.append((istack[-1], lineno, l)) if istack: raise WeaveFormatError( "unclosed insertion blocks at end of weave: {}".format(istack) ) if dset: raise WeaveFormatError(f"unclosed deletion blocks at end of weave: {dset}") return result def _maybe_lookup(self, name_or_index): """Convert possible symbolic name to index, or pass through indexes. NOT FOR PUBLIC USE. """ # GZ 2017-04-01: This used to check for long as well, but I don't think # there are python implementations with sys.maxsize > sys.maxint if isinstance(name_or_index, int): return name_or_index else: return self._lookup(name_or_index) def get_lines(self, version_id): """See VersionedFile.get_lines().""" int_index = self._maybe_lookup(version_id) result = [line for (origin, lineno, line) in self._extract([int_index])] expected_sha1 = self._sha1s[int_index] measured_sha1 = sha_strings(result) if measured_sha1 != expected_sha1: raise WeaveInvalidChecksum( "file {}, revision {}, expected: {}, measured {}".format( self._weave_name, version_id, expected_sha1, measured_sha1 ) ) return result def get_sha1s(self, version_ids): """See VersionedFile.get_sha1s().""" result = {} for v in version_ids: result[v] = self._sha1s[self._lookup(v)] return result def num_versions(self): """How many versions are in this weave?""" return len(self._parents) __len__ = num_versions def check(self, progress_bar=None): """Check the internal consistency of this weave. Args: progress_bar: Optional progress bar for long-running checks. Raises: WeaveFormatError: If format violations are found. WeaveInvalidChecksum: If text doesn't match expected checksums. """ # TODO evaluate performance hit of using string sets in this routine. # TODO: check no circular inclusions # TODO: create a nested progress bar for version in range(self.num_versions()): inclusions = list(self._parents[version]) if inclusions: inclusions.sort() if inclusions[-1] >= version: raise WeaveFormatError( "invalid included version %d for index %d" % (inclusions[-1], version) ) # try extracting all versions; parallel extraction is used nv = self.num_versions() sha1s = {} texts = {} inclusions = {} for i in range(nv): # For creating the ancestry, IntSet is much faster (3.7s vs 0.17s) # The problem is that set membership is much more expensive name = self._idx_to_name(i) sha1s[name] = hashlib.sha1() # noqa: S324 texts[name] = [] new_inc = {name} for p in self._parents[i]: new_inc.update(inclusions[self._idx_to_name(p)]) if new_inc != self.get_ancestry(name): raise AssertionError(f"failed {new_inc} != {self.get_ancestry(name)}") inclusions[name] = new_inc nlines = len(self._weave) update_text = "checking weave" if self._weave_name: short_name = os.path.basename(self._weave_name) update_text = f"checking {short_name}" update_text = update_text[:25] for lineno, insert, deleteset, line in self._walk_internal(): if progress_bar: progress_bar.update(update_text, lineno, nlines) for name, name_inclusions in inclusions.items(): # The active inclusion must be an ancestor, # and no ancestors must have deleted this line, # because we don't support resurrection. if (insert in name_inclusions) and not (deleteset & name_inclusions): sha1s[name].update(line) for i in range(nv): version = self._idx_to_name(i) hd = sha1s[version].hexdigest().encode() expected = self._sha1s[i] if hd != expected: raise WeaveInvalidChecksum( f"mismatched sha1 for version {version}: " f"got {hd}, expected {expected}" ) # TODO: check insertions are properly nested, that there are # no lines outside of insertion blocks, that deletions are # properly paired, etc. def _imported_parents(self, other, other_idx): """Return list of parents in self corresponding to indexes in other.""" new_parents = [] for parent_idx in other._parents[other_idx]: parent_name = other._names[parent_idx] if parent_name not in self._name_map: # should not be possible raise WeaveError( f"missing parent {{{parent_name}}} of {{{other._name_map[other_idx]}}} in {self!r}" ) new_parents.append(self._name_map[parent_name]) return new_parents def _check_version_consistent(self, other, other_idx, name): """Check if a version in consistent in this and other. To be consistent it must have: * the same text * the same direct parents (by name, not index, and disregarding order) If present & correct return True; if not present in self return False; if inconsistent raise error. """ this_idx = self._name_map.get(name, -1) if this_idx != -1: if self._sha1s[this_idx] != other._sha1s[other_idx]: raise WeaveTextDiffers(name, self, other) self_parents = self._parents[this_idx] other_parents = other._parents[other_idx] n1 = {self._names[i] for i in self_parents} n2 = {other._names[i] for i in other_parents} if not self._compatible_parents(n1, n2): raise WeaveParentMismatch( f"inconsistent parents for version {{{name}}}: {n1} vs {n2}" ) else: return True # ok! else: return False def _reweave(self, other, pb, msg): """Reweave self with other - internal helper for join(). :param other: The other weave to merge :param pb: An optional progress bar, indicating how far done we are :param msg: An optional message for the progress """ new_weave = _reweave(self, other, pb=pb, msg=msg) self._copy_weave_content(new_weave) def _copy_weave_content(self, otherweave): """Adsorb the content from otherweave.""" for attr in self.__slots__: if attr != "_weave_name": setattr(self, attr, copy(getattr(otherweave, attr))) class WeaveFile(Weave): """A WeaveFile represents a Weave on disk and writes on change.""" WEAVE_SUFFIX = ".weave" def __init__( self, name, transport, filemode=None, create=False, access_mode="w", get_scope=None, ): """Create a WeaveFile. :param create: If not True, only open an existing knit. """ super().__init__(name, access_mode, get_scope=get_scope, allow_reserved=False) self._transport = transport self._filemode = filemode try: with self._transport.get(name + WeaveFile.WEAVE_SUFFIX) as f: _read_weave_v5(BytesIO(f.read()), self) except TransportNoSuchFile: if not create: raise # new file, save it self._save() def _add_lines( self, version_id, parents, lines, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ): """Add a version and save the weave.""" self.check_not_reserved_id(version_id) result = super()._add_lines( version_id, parents, lines, parent_texts, left_matching_blocks, nostore_sha, random_id, check_content, ) self._save() return result def copy_to(self, name, transport): """See VersionedFile.copy_to().""" # as we are all in memory always, just serialise to the new place. sio = BytesIO() write_weave_v5(self, sio) sio.seek(0) transport.put_file(name + WeaveFile.WEAVE_SUFFIX, sio, self._filemode) def _save(self): """Save the weave.""" self._check_write_ok() sio = BytesIO() write_weave_v5(self, sio) sio.seek(0) bytes = sio.getvalue() path = self._weave_name + WeaveFile.WEAVE_SUFFIX try: self._transport.put_bytes(path, bytes, self._filemode) except TransportNoSuchFile: self._transport.mkdir(os.path.dirname(path)) self._transport.put_bytes(path, bytes, self._filemode) @staticmethod def get_suffixes(): """See VersionedFile.get_suffixes().""" return [WeaveFile.WEAVE_SUFFIX] def insert_record_stream(self, stream): """Insert records from a stream and save the weave file. Args: stream: A stream of records to insert. """ super().insert_record_stream(stream) self._save() def _reweave(wa, wb, pb=None, msg=None): """Combine two weaves and return the result. This works even if a revision R has different parents in wa and wb. In the resulting weave all the parents are given. This is done by just building up a new weave, maintaining ordering of the versions in the two inputs. More efficient approaches might be possible but it should only be necessary to do this operation rarely, when a new previously ghost version is inserted. :param pb: An optional progress bar, indicating how far done we are :param msg: An optional message for the progress """ import vcsgraph.tsort as tsort wr = Weave() # first determine combined parents of all versions # map from version name -> all parent names combined_parents = _reweave_parent_graphs(wa, wb) logger.debug("combined parents: %r", combined_parents) order = tsort.topo_sort(combined_parents.items()) logger.debug("order to reweave: %r", order) if pb and not msg: msg = "reweave" for idx, name in enumerate(order): if pb: pb.update(msg, idx, len(order)) if name in wa._name_map: lines = wa.get_lines(name) if name in wb._name_map: lines_b = wb.get_lines(name) if lines != lines_b: logger.debug("Weaves differ on content. rev_id {%s}", name) logger.debug("weaves: %s, %s", wa._weave_name, wb._weave_name) import difflib lines = list( difflib.unified_diff( lines, lines_b, wa._weave_name, wb._weave_name ) ) logger.debug("lines:\n%s", "".join(lines)) raise WeaveTextDiffers(name, wa, wb) else: lines = wb.get_lines(name) wr._add(name, lines, [wr._lookup(i) for i in combined_parents[name]]) return wr def _reweave_parent_graphs(wa, wb): """Return combined parent ancestry for two weaves. Returned as a list of (version_name, set(parent_names)) """ combined = {} for weave in [wa, wb]: for idx, name in enumerate(weave._names): p = combined.setdefault(name, set()) p.update(map(weave._idx_to_name, weave._parents[idx])) return combined bzrformats_3.4.0.orig/bzrformats/weavefile.py0000644000000000000000000001377215162074037016407 0ustar00# Copyright (C) 2005-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # Author: Martin Pool """Store and retrieve weaves in files. There is one format marker followed by a blank line, followed by a series of version headers, followed by the weave itself. Each version marker has 'i' parent version indexes '1' SHA-1 of text 'n' name The inclusions do not need to list versions included by a parent. The weave is bracketed by 'w' and 'W' lines, and includes the '{}[]' processing instructions. Lines of text are prefixed by '.' if the line contains a newline, or ',' if not. """ # TODO: When extracting a single version it'd be enough to just pass # an iterator returning the weave lines... We don't really need to # deserialize it into memory. FORMAT_1 = b"# bzr weave file v5\n" def write_weave(weave, f, format=None): """Write a weave to a file. Args: weave: The weave object to write. f: File-like object to write to. format: The weave format version to use. Currently only supports None or 1. Raises: ValueError: If an unknown format is specified. Returns: The result of write_weave_v5 (None). """ if format is None or format == 1: return write_weave_v5(weave, f) else: raise ValueError(f"unknown weave format {format!r}") def write_weave_v5(weave, f): """Write weave to file f.""" f.write(FORMAT_1) for version, included in enumerate(weave._parents): if included: # mininc = weave.minimal_parents(version) mininc = included f.write(b"i ") f.write(b" ".join(b"%d" % i for i in mininc)) f.write(b"\n") else: f.write(b"i\n") f.write(b"1 " + weave._sha1s[version] + b"\n") f.write(b"n " + weave._names[version] + b"\n") f.write(b"\n") f.write(b"w\n") for l in weave._weave: if isinstance(l, tuple): if l[0] == b"}": f.write(b"}\n") else: f.write(l[0] + b" %d\n" % l[1]) else: # text line if not l: f.write(b", \n") elif l.endswith(b"\n"): f.write(b". " + l) else: f.write(b", " + l + b"\n") f.write(b"W\n") def read_weave(f): """Read a weave from a file. Args: f: File-like object to read from. Returns: A Weave object containing the data read from the file. """ # FIXME: detect the weave type and dispatch from .weave import Weave w = Weave(getattr(f, "name", None)) _read_weave_v5(f, w) return w def _read_weave_v5(f, w): """Private helper routine to read a weave format 5 file into memory. This is only to be used by read_weave and WeaveFile.__init__. """ # 200 0 2075.5080 1084.0360 bzrformats.weavefile:104(_read_weave_v5) # +60412 0 366.5900 366.5900 + # +59982 0 320.5280 320.5280 + # +59363 0 297.8080 297.8080 + # replace readline call with iter over all lines -> # safe because we already suck on memory. # 200 0 1492.7170 802.6220 bzrformats.weavefile:104(_read_weave_v5) # +59982 0 329.9100 329.9100 + # +59363 0 320.2980 320.2980 + # replaced startswith with slice lookups: # 200 0 851.7250 501.1120 bzrformats.weavefile:104(_read_weave_v5) # +59363 0 311.8780 311.8780 + # +200 0 30.2500 30.2500 + from .weave import WeaveFormatError try: lines = iter(f.readlines()) finally: f.close() try: l = next(lines) except StopIteration as err: raise WeaveFormatError("invalid weave file: no header") from err if l != FORMAT_1: raise WeaveFormatError(f"invalid weave file header: {l!r}") ver = 0 # read weave header. while True: try: l = next(lines) except StopIteration as err: raise WeaveFormatError("unexpected end of file") from err if l[0:1] == b"i": if len(l) > 2: w._parents.append(list(map(int, l[2:].split(b" ")))) else: w._parents.append([]) l = next(lines)[:-1] w._sha1s.append(l[2:]) l = next(lines) name = l[2:-1] w._names.append(name) w._name_map[name] = ver l = next(lines) ver += 1 elif l == b"w\n": break else: raise WeaveFormatError(f"unexpected line {l!r}") # read weave body while True: try: l = next(lines) except StopIteration as err: raise WeaveFormatError("unexpected end of file") from err if l == b"W\n": break elif l[0:2] == b". ": w._weave.append(l[2:]) # include newline elif l[0:2] == b", ": w._weave.append(l[2:-1]) # exclude newline elif l == b"}\n": w._weave.append((b"}", None)) else: w._weave.append((l[0:1], int(l[2:].decode("ascii")))) return w bzrformats_3.4.0.orig/bzrformats/xml4.py0000644000000000000000000002024315162115103015301 0ustar00# Copyright (C) 2005-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """XML serialization support for weave format version 4.""" from . import inventory from . import revision as _mod_revision from .errors import BzrFormatsError as BzrError from .inventory import ROOT_ID, Inventory from .xml_serializer import ( Element, SubElement, XMLInventorySerializer, XMLRevisionSerializer, escape_invalid_chars, ) class Revision(_mod_revision.Revision): """Revision class with additional v4-specific attributes.""" def __new__(cls, *args, **kwargs): """Create new Revision instance with inventory_id and parent_sha1s. Args: *args: Positional arguments passed to parent class. **kwargs: Keyword arguments, including inventory_id and parent_sha1s. Returns: New Revision instance with additional attributes. """ inventory_id = kwargs.pop("inventory_id", None) parent_sha1s = kwargs.pop("parent_sha1s", None) self = _mod_revision.Revision.__new__(cls, *args, **kwargs) self.inventory_id = inventory_id self.parent_sha1s = parent_sha1s return self class _RevisionSerializer_v4(XMLRevisionSerializer): """Version 0.0.4 serializer. You should use the revision_serializer_v4 singleton. v4 serialisation is no longer supported, only deserialisation. """ __slots__: list[str] = [] def _pack_revision(self, rev): """Revision object -> xml tree.""" root = Element( "revision", committer=rev.committer, timestamp=f"{rev.timestamp:.9f}", revision_id=rev.revision_id, inventory_id=rev.inventory_id, inventory_sha1=rev.inventory_sha1, ) if rev.timezone: root.set("timezone", str(rev.timezone)) root.text = "\n" msg = SubElement(root, "message") msg.text = escape_invalid_chars(rev.message)[0] msg.tail = "\n" if rev.parents: pelts = SubElement(root, "parents") pelts.tail = pelts.text = "\n" for i, parent_id in enumerate(rev.parents): p = SubElement(pelts, "revision_ref") p.tail = "\n" p.set("revision_id", parent_id) if i < len(rev.parent_sha1s): p.set("revision_sha1", rev.parent_sha1s[i]) return root def write_revision_to_string(self, rev): return tostring(self._pack_revision(rev)) + b"\n" def _write_element(self, elt, f): ElementTree(elt).write(f, "utf-8") f.write(b"\n") def _unpack_revision(self, elt): """XML Element -> Revision object.""" # is deprecated... if elt.tag not in ("revision", "changeset"): raise BzrError(f"unexpected tag in revision file: {elt!r}") v = elt.get("timezone") timezone = v and int(v) message = elt.findtext("message") # text of precursor = elt.get("precursor") precursor_sha1 = elt.get("precursor_sha1") pelts = elt.find("parents") parent_ids = [] parent_sha1s = [] if pelts: for p in pelts: parent_ids.append(p.get("revision_id").encode("utf-8")) parent_sha1s.append( p.get("revision_sha1").encode("utf-8") if p.get("revision_sha1") else None ) if precursor: # must be consistent parent_ids[0] elif precursor: # revisions written prior to 0.0.5 have a single precursor # give as an attribute parent_ids.append(precursor) parent_sha1s.append(precursor_sha1) return Revision( committer=elt.get("committer"), timestamp=float(elt.get("timestamp")), revision_id=elt.get("revision_id").encode("utf-8"), inventory_id=elt.get("inventory_id").encode("utf-8"), inventory_sha1=elt.get("inventory_sha1").encode("utf-8"), timezone=timezone, message=message, parent_ids=parent_ids, parent_sha1s=parent_sha1s, properties={}, ) class _InventorySerializer_v4(XMLInventorySerializer): """Version 0.0.4 serializer. You should use the inventory_serializer_v4 singleton. v4 serialisation is no longer supported, only deserialisation. """ def _pack_entry(self, ie): """Convert InventoryEntry to XML element.""" e = Element("entry") e.set("name", ie.name) e.set("file_id", ie.file_id.decode("ascii")) e.set("kind", ie.kind) if ie.text_size is not None: e.set("text_size", "%d" % ie.text_size) for f in ["text_id", "text_sha1", "symlink_target"]: v = getattr(ie, f) if v is not None: e.set(f, v) # to be conservative, we don't externalize the root pointers # for now, leaving them as null in the xml form. in a future # version it will be implied by nested elements. if ie.parent_id != ROOT_ID: e.set("parent_id", ie.parent_id) e.tail = "\n" return e def _unpack_inventory( self, elt, revision_id=None, entry_cache=None, return_from_cache=False ): """Construct from XML Element. :param revision_id: Ignored parameter used by xml5. """ root_id = elt.get("file_id") root_id = root_id.encode("ascii") if root_id else ROOT_ID inv = Inventory(root_id) for e in elt: ie = self._unpack_entry( e, entry_cache=entry_cache, return_from_cache=return_from_cache, root_id=root_id, ) inv.add(ie) return inv def _unpack_entry(self, elt, root_id, entry_cache=None, return_from_cache=False): # original format inventories don't have a parent_id for # nodes in the root directory, but it's cleaner to use one # internally. parent_id = elt.get("parent_id") parent_id = parent_id.encode("ascii") if parent_id else ROOT_ID if parent_id == ROOT_ID: parent_id = root_id file_id = elt.get("file_id").encode("ascii") kind = elt.get("kind") if kind == "directory": ie = inventory.InventoryDirectory(file_id, elt.get("name"), parent_id) elif kind == "file": text_id = elt.get("text_id") if text_id is not None: text_id = text_id.encode("utf-8") text_sha1 = elt.get("text_sha1") if text_sha1 is not None: text_sha1 = text_sha1.encode("ascii") v = elt.get("text_size") text_size = v and int(v) ie = inventory.InventoryFile( file_id, elt.get("name"), parent_id, text_size=text_size, text_sha1=text_sha1, text_id=text_id, ) elif kind == "symlink": ie = inventory.InventoryLink( file_id, elt.get("name"), parent_id, symlink_target=elt.get("symlink_target"), ) else: raise BzrError(f"unknown kind {kind!r}") ## mutter("read inventoryentry: %r", elt.attrib) return ie revision_serializer_v4 = _RevisionSerializer_v4() inventory_serializer_v4 = _InventorySerializer_v4() bzrformats_3.4.0.orig/bzrformats/xml5.py0000644000000000000000000000617715162115103015314 0ustar00# Copyright (C) 2008, 2009, 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """XML serialization format version 5 for inventories.""" from . import inventory, xml6 from ._bzr_rs import revision_serializer_v5 # noqa: F401 from .errors import BzrFormatsError from .xml_serializer import encode_and_escape, get_utf8_or_ascii, unpack_inventory_entry class InventorySerializer_v5(xml6.InventorySerializer_v6): """Version 5 serializer. Packs objects into XML and vice versa. """ format_num = b"5" root_id = inventory.ROOT_ID def _unpack_inventory( self, elt, revision_id, entry_cache=None, return_from_cache=False ): """Construct from XML Element.""" root_id = elt.get("file_id") or inventory.ROOT_ID root_id = get_utf8_or_ascii(root_id) format = elt.get("format") if format is not None and format != "5": raise BzrFormatsError(f"invalid format version {format!r} on inventory") data_revision_id = elt.get("revision_id") if data_revision_id is not None: revision_id = data_revision_id.encode("utf-8") inv = inventory.Inventory(root_id=None, revision_id=revision_id) root = inventory.InventoryDirectory(root_id, "", None, revision=revision_id) inv.add(root) # Optimizations tested # baseline w/entry cache 2.85s # using inv._byid 2.55s # avoiding attributes 2.46s # adding assertions 2.50s # last_parent cache 2.52s (worse, removed) for e in elt: ie = unpack_inventory_entry( e, entry_cache=entry_cache, return_from_cache=return_from_cache, root_id=root_id, ) inv.add(ie) self._check_cache_size(len(inv), entry_cache) return inv def _append_inventory_root(self, append, inv): """Append the inventory root to output.""" if inv.root.file_id not in (None, inventory.ROOT_ID): fileid = b"".join( [b' file_id="', encode_and_escape(inv.root.file_id), b'"'] ) else: fileid = b"" if inv.revision_id is not None: revid = b"".join( [b' revision_id="', encode_and_escape(inv.revision_id), b'"'] ) else: revid = b"" append(b'\n' % (fileid, revid)) inventory_serializer_v5 = InventorySerializer_v5() bzrformats_3.4.0.orig/bzrformats/xml6.py0000644000000000000000000000231215162073400015303 0ustar00# Copyright (C) 2008 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """XML serialization format version 6 for inventories with rich roots.""" from . import xml8 class InventorySerializer_v6(xml8.InventorySerializer_v8): """This serialiser supports rich roots. While its inventory format number is 6, its revision format is 5. Its inventory_sha1 may be inaccurate-- the inventory may have been converted from format 5 or 7 without updating the sha1. """ format_num = b"6" inventory_serializer_v6 = InventorySerializer_v6() bzrformats_3.4.0.orig/bzrformats/xml7.py0000644000000000000000000000216515162073400015312 0ustar00# Copyright (C) 2006-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """XML serialization format version 7 with tree reference support.""" from . import xml6 class InventorySerializer_v7(xml6.InventorySerializer_v6): """A Serializer that supports tree references.""" # this format is used by BzrBranch6 supported_kinds = {"file", "directory", "symlink", "tree-reference"} format_num = b"7" inventory_serializer_v7 = InventorySerializer_v7() bzrformats_3.4.0.orig/bzrformats/xml8.py0000644000000000000000000002564615162115103015321 0ustar00# Copyright (C) 2005-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """XML serialization format version 8. This module provides XML-based inventory serialization for Bazaar format 8. It includes support for rich roots and the altered-by hack for efficient file ID lookups. """ import logging import re from ._bzr_rs import revision_serializer_v8 # noqa: F401 from .xml_serializer import ( XMLInventorySerializer, encode_and_escape, serialize_inventory_flat, unpack_inventory_entry, unpack_inventory_flat, ) logger = logging.getLogger("bzrformats.xml8") _xml_unescape_map = { b"apos": b"'", b"quot": b'"', b"amp": b"&", b"lt": b"<", b"gt": b">", } def _unescaper(match, _map=_xml_unescape_map): """Unescape XML entity references. Args: match: A regex match object containing the entity code. _map: Dictionary mapping entity names to their unescaped values. Returns: bytes: The unescaped character(s). Raises: KeyError: If the entity code is not recognized. """ code = match.group(1) try: return _map[code] except KeyError: if not code.startswith(b"#"): raise return chr(int(code[1:])).encode("utf8") _unescape_re = re.compile(b"\\&([^;]*);") def _unescape_xml(data): """Unescape predefined XML entities in a string of data.""" return _unescape_re.sub(_unescaper, data) class InventorySerializer_v8(XMLInventorySerializer): """This serialiser adds rich roots. Its revision format number matches its inventory number. """ __slots__: list[str] = [] root_id: bytes | None = None support_altered_by_hack = True # This format supports the altered-by hack that reads file ids directly out # of the versionedfile, without doing XML parsing. supported_kinds = {"file", "directory", "symlink"} format_num = b"8" revision_format_num: bytes | None = None # The search regex used by xml based repositories to determine what things # where changed in a single commit. _file_ids_altered_regex = re.compile( b'file_id="(?P[^"]+)".* revision="(?P[^"]+)"' ) def _check_revisions(self, inv): """Extension point for subclasses to check during serialisation. :param inv: An inventory about to be serialised, to be checked. :raises: AssertionError if an error has occurred. """ if inv.revision_id is None: raise AssertionError("inv.revision_id is None") if inv.root.revision is None: raise AssertionError("inv.root.revision is None") def _check_cache_size(self, inv_size, entry_cache): """Check that the entry_cache is large enough. We want the cache to be ~2x the size of an inventory. The reason is because we use a FIFO cache, and how Inventory records are likely to change. In general, you have a small number of records which change often, and a lot of records which do not change at all. So when the cache gets full, you actually flush out a lot of the records you are interested in, which means you need to recreate all of those records. An LRU Cache would be better, but the overhead negates the cache coherency benefit. One way to look at it, only the size of the cache > len(inv) is your 'working' set. And in general, it shouldn't be a problem to hold 2 inventories in memory anyway. :param inv_size: The number of entries in an inventory. """ if entry_cache is None: return # 1.5 times might also be reasonable. recommended_min_cache_size = inv_size * 1.5 if entry_cache.cache_size() < recommended_min_cache_size: recommended_cache_size = inv_size * 2 logger.debug( "Resizing the inventory entry cache from %d to %d", entry_cache.cache_size(), recommended_cache_size, ) entry_cache.resize(recommended_cache_size) def write_inventory_to_lines(self, inv): """Return a list of lines with the encoded inventory.""" return self.write_inventory(inv, None) def write_inventory_to_chunks(self, inv): """Write inventory to chunks. Args: inv: The inventory to serialize. Returns: list: The inventory serialized as a list of byte chunks. """ return self.write_inventory(inv, None) def write_inventory(self, inv, f, working=False): """Write inventory to a file. :param inv: the inventory to write. :param f: the file to write. (May be None if the lines are the desired output). :param working: If True skip history data - text_sha1, text_size, reference_revision, symlink_target. :return: The inventory as a list of lines. """ output = [] append = output.append self._append_inventory_root(append, inv) serialize_inventory_flat( inv, append, self.root_id, self.supported_kinds, working ) if f is not None: f.writelines(output) # Just to keep the cache from growing without bounds # but we may actually not want to do clear the cache # _clear_cache() return output def _append_inventory_root(self, append, inv): """Append the inventory root to output.""" if inv.revision_id is not None: revid1 = b"".join( [b' revision_id="', encode_and_escape(inv.revision_id), b'"'] ) else: revid1 = b"" append(b'\n' % (self.format_num, revid1)) append( b'\n' % ( encode_and_escape(inv.root.file_id), encode_and_escape(inv.root.name), encode_and_escape(inv.root.revision), ) ) def _unpack_entry(self, elt, entry_cache=None, return_from_cache=False): # This is here because it's overridden by xml7 return unpack_inventory_entry(elt, entry_cache, return_from_cache) def _unpack_inventory( self, elt, revision_id=None, entry_cache=None, return_from_cache=False ): """Construct from XML Element.""" inv = unpack_inventory_flat( elt, self.format_num, self._unpack_entry, entry_cache, return_from_cache ) self._check_cache_size(len(inv), entry_cache) return inv def _find_text_key_references(self, line_iterator): """Core routine for extracting references to texts from inventories. This performs the translation of xml lines to revision ids. :param line_iterator: An iterator of lines, origin_version_id :return: A dictionary mapping text keys ((fileid, revision_id) tuples) to whether they were referred to by the inventory of the revision_id that they contain. Note that if that revision_id was not part of the line_iterator's output then False will be given - even though it may actually refer to that key. """ if not self.support_altered_by_hack: raise AssertionError( "_find_text_key_references only " "supported for branches which store inventory as unnested xml" ", not on {!r}".format(self) ) result = {} # this code needs to read every new line in every inventory for the # inventories [revision_ids]. Seeing a line twice is ok. Seeing a line # not present in one of those inventories is unnecessary but not # harmful because we are filtering by the revision id marker in the # inventory lines : we only select file ids altered in one of those # revisions. We don't need to see all lines in the inventory because # only those added in an inventory in rev X can contain a revision=X # line. unescape_revid_cache = {} unescape_fileid_cache = {} # jam 20061218 In a big fetch, this handles hundreds of thousands # of lines, so it has had a lot of inlining and optimizing done. # Sorry that it is a little bit messy. # Move several functions to be local variables, since this is a long # running loop. search = self._file_ids_altered_regex.search unescape = _unescape_xml setdefault = result.setdefault for line, line_key in line_iterator: match = search(line) if match is None: continue # One call to match.group() returning multiple items is quite a # bit faster than 2 calls to match.group() each returning 1 file_id, revision_id = match.group("file_id", "revision_id") # Inlining the cache lookups helps a lot when you make 170,000 # lines and 350k ids, versus 8.4 unique ids. # Using a cache helps in 2 ways: # 1) Avoids unnecessary decoding calls # 2) Re-uses cached strings, which helps in future set and # equality checks. # (2) is enough that removing encoding entirely along with # the cache (so we are using plain strings) results in no # performance improvement. try: revision_id = unescape_revid_cache[revision_id] except KeyError: unescaped = unescape(revision_id) unescape_revid_cache[revision_id] = unescaped revision_id = unescaped # Note that unconditionally unescaping means that we deserialise # every fileid, which for general 'pull' is not great, but we don't # really want to have some many fulltexts that this matters anyway. # RBC 20071114. try: file_id = unescape_fileid_cache[file_id] except KeyError: unescaped = unescape(file_id) unescape_fileid_cache[file_id] = unescaped file_id = unescaped key = (file_id, revision_id) setdefault(key, False) if revision_id == line_key[-1]: result[key] = True return result inventory_serializer_v8 = InventorySerializer_v8() bzrformats_3.4.0.orig/bzrformats/xml_serializer.py0000644000000000000000000003554315162115103017457 0ustar00# Copyright (C) 2005-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """XML externalization support.""" # "XML is like violence: if it doesn't solve your problem, you aren't # using enough of it." -- various # importing this module is fairly slow because it has to load several # ElementTree bits __all__ = [ "Element", "ElementTree", "SubElement", "escape_invalid_chars", "fromstring", "fromstringlist", "get_utf8_or_ascii", "serialize_inventory_flat", "tostring", "tostringlist", "unpack_inventory_entry", "unpack_inventory_flat", ] from xml.etree.ElementTree import ( Element, ElementTree, ParseError, SubElement, fromstring, fromstringlist, tostring, tostringlist, ) from . import inventory, serializer class XMLRevisionSerializer(serializer.RevisionSerializer): """Abstract XML object serialize/deserialize.""" squashes_xml_invalid_characters = True def _unpack_revision(self, element): raise NotImplementedError(self._unpack_revision) def write_revision_to_string(self, rev): """Serialize a revision object to a UTF-8 string.""" return b"".join(self.write_revision_to_lines(rev)) def read_revision(self, f): """Read a revision from an open file object.""" return self._unpack_revision(self._read_element(f)) def read_revision_from_string(self, xml_string): """Read a revision from an XML string.""" return self._unpack_revision(fromstring(xml_string)) # noqa: S314 def _read_element(self, f): return ElementTree().parse(f) class XMLInventorySerializer(serializer.InventorySerializer): """Abstract XML object serialize/deserialize.""" def read_inventory_from_lines( self, lines, revision_id=None, entry_cache=None, return_from_cache=False ): """Read xml_string into an inventory object. :param chunks: The xml to read. :param revision_id: If not-None, the expected revision id of the inventory. Some serialisers use this to set the results' root revision. This should be supplied for deserialising all from-repository inventories so that xml5 inventories that were serialised without a revision identifier can be given the right revision id (but not for working tree inventories where users can edit the data without triggering checksum errors or anything). :param entry_cache: An optional cache of InventoryEntry objects. If supplied we will look up entries via (file_id, revision_id) which should map to a valid InventoryEntry (File/Directory/etc) object. :param return_from_cache: Return entries directly from the cache, rather than copying them first. This is only safe if the caller promises not to mutate the returned inventory entries, but it can make some operations significantly faster. """ try: return self._unpack_inventory( fromstringlist(lines), revision_id, entry_cache=entry_cache, return_from_cache=return_from_cache, ) except ParseError as e: raise serializer.UnexpectedInventoryFormat(str(e)) from e def _unpack_inventory( self, element, revision_id: bytes | None = None, entry_cache=None, return_from_cache=False, ): raise NotImplementedError(self._unpack_inventory) def read_inventory(self, f, revision_id=None): """Read an inventory from an open file object.""" try: try: return self._unpack_inventory(self._read_element(f), revision_id=None) finally: f.close() except ParseError as e: raise serializer.UnexpectedInventoryFormat(str(e)) from e def _read_element(self, f): return ElementTree().parse(f) def get_utf8_or_ascii(a_str): """Return a cached version of the string. cElementTree will return a plain string if the XML is plain ascii. It only returns Unicode when it needs to. We want to work in utf-8 strings. So if cElementTree returns a plain string, we can just return the cached version. If it is Unicode, then we need to encode it. :param a_str: An 8-bit string or Unicode as returned by cElementTree.Element.get() :return: A utf-8 encoded 8-bit string. """ # This is fairly optimized because we know what cElementTree does, this is # not meant as a generic function for all cases. Because it is possible for # an 8-bit string to not be ascii or valid utf8. if a_str.__class__ is str: return a_str.encode("utf-8") else: return a_str from ._bzr_rs import encode_and_escape, escape_invalid_chars def unpack_inventory_entry( elt, entry_cache=None, return_from_cache=False, root_id=None ): """Unpack an inventory entry from XML element.""" elt_get = elt.get file_id = elt_get("file_id") revision = elt_get("revision") # Check and see if we have already unpacked this exact entry # Some timings for "repo.revision_trees(last_100_revs)" # bzr mysql # unmodified 4.1s 40.8s # using lru 3.5s # using fifo 2.83s 29.1s # lru._cache 2.8s # dict 2.75s 26.8s # inv.add 2.5s 26.0s # no_copy 2.00s 20.5s # no_c,dict 1.95s 18.0s # Note that a cache of 10k nodes is more than sufficient to hold all of # the inventory for the last 100 revs for bzr, but not for mysql (20k # is enough for mysql, which saves the same 2s as using a dict) # Breakdown of mysql using time.clock() # 4.1s 2 calls to element.get for file_id, revision_id # 4.5s cache_hit lookup # 7.1s InventoryFile.copy() # 2.4s InventoryDirectory.copy() # 0.4s decoding unique entries # 1.6s decoding entries after FIFO fills up # 0.8s Adding nodes to FIFO (including flushes) # 0.1s cache miss lookups # Using an LRU cache # 4.1s 2 calls to element.get for file_id, revision_id # 9.9s cache_hit lookup # 10.8s InventoryEntry.copy() # 0.3s cache miss lookus # 1.2s decoding entries # 1.0s adding nodes to LRU if entry_cache is not None and revision is not None: key = (file_id, revision) try: # We copy it, because some operations may mutate it cached_ie = entry_cache[key] except KeyError: pass else: # Only copying directory entries drops us 2.85s => 2.35s if return_from_cache: if cached_ie.kind == "directory": return cached_ie.copy() return cached_ie return cached_ie.copy() kind = elt.tag if not inventory.InventoryEntry.versionable_kind(kind): raise AssertionError(f"unsupported entry kind {kind}") file_id = get_utf8_or_ascii(file_id) if revision is not None: revision = get_utf8_or_ascii(revision) parent_id = elt_get("parent_id") parent_id = get_utf8_or_ascii(parent_id) if parent_id is not None else root_id if kind == "directory": ie = inventory.InventoryDirectory(file_id, elt_get("name"), parent_id, revision) elif kind == "file": text_sha1 = elt_get("text_sha1") if text_sha1 is not None: text_sha1 = text_sha1.encode("ascii") executable = elt_get("executable") == "yes" v = elt_get("text_size") text_size = v and int(v) ie = inventory.InventoryFile( file_id, elt_get("name"), parent_id, revision, text_sha1=text_sha1, executable=executable, text_size=text_size, ) elif kind == "symlink": symlink_target = elt_get("symlink_target") ie = inventory.InventoryLink( file_id, elt_get("name"), parent_id, revision, symlink_target=symlink_target ) elif kind == "tree-reference": file_id = get_utf8_or_ascii(elt.attrib["file_id"]) name = elt.attrib["name"] parent_id = get_utf8_or_ascii(elt.attrib["parent_id"]) revision = get_utf8_or_ascii(elt.get("revision")) reference_revision = get_utf8_or_ascii(elt.get("reference_revision")) ie = inventory.TreeReference( file_id, name, parent_id, revision, reference_revision ) else: raise serializer.UnsupportedInventoryKind(kind) if revision is not None and entry_cache is not None: # We cache a copy() because callers like to mutate objects, and # that would cause the item in cache to mutate as well. # This has a small effect on many-inventory performance, because # the majority fraction is spent in cache hits, not misses. entry_cache[key] = ie.copy() return ie def unpack_inventory_flat( elt, format_num, unpack_entry, entry_cache=None, return_from_cache=False ): """Unpack a flat XML inventory. :param elt: XML element for the inventory :param format_num: Expected format number :param unpack_entry: Function for unpacking inventory entries :return: An inventory :raise UnexpectedInventoryFormat: When unexpected elements or data is encountered """ if elt.tag != "inventory": raise serializer.UnexpectedInventoryFormat(f"Root tag is {elt.tag!r}") format = elt.get("format") if (format is None and format_num is not None) or format.encode() != format_num: raise serializer.UnexpectedInventoryFormat(f"Invalid format version {format!r}") revision_id = elt.get("revision_id") if revision_id is not None: revision_id = revision_id.encode("utf-8") inv = inventory.Inventory(root_id=None, revision_id=revision_id) for e in elt: ie = unpack_entry(e, entry_cache, return_from_cache) inv.add(ie) return inv def serialize_inventory_flat(inv, append, root_id, supported_kinds, working): """Serialize an inventory to a flat XML file. :param inv: Inventory to serialize :param append: Function for writing a line of output :param working: If True skip history data - text_sha1, text_size, reference_revision, symlink_target. """ entries = inv.iter_entries() # Skip the root _root_path, _root_ie = next(entries) for _path, ie in entries: if ie.parent_id != root_id: parent_str = b"".join( [b' parent_id="', encode_and_escape(ie.parent_id), b'"'] ) else: parent_str = b"" if ie.kind == "file": executable = b' executable="yes"' if ie.executable else b"" if not working: append( b'\n' % ( executable, encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, encode_and_escape(ie.revision), ie.text_sha1, ie.text_size, ) ) else: append( b'\n' % ( executable, encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, ) ) elif ie.kind == "directory": if not working: append( b'\n" % ( encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, encode_and_escape(ie.revision), ) ) else: append( b'\n' % ( encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, ) ) elif ie.kind == "symlink": if not working: append( b'\n' % ( encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, encode_and_escape(ie.revision), encode_and_escape(ie.symlink_target), ) ) else: append( b'\n' % ( encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, ) ) elif ie.kind == "tree-reference": if ie.kind not in supported_kinds: raise serializer.UnsupportedInventoryKind(ie.kind) if not working: append( b'\n' % ( encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, encode_and_escape(ie.revision), encode_and_escape(ie.reference_revision), ) ) else: append( b'\n' % ( encode_and_escape(ie.file_id), encode_and_escape(ie.name), parent_str, ) ) else: raise serializer.UnsupportedInventoryKind(ie.kind) append(b"\n") bzrformats_3.4.0.orig/bzrformats/tests/__init__.py0000644000000000000000000004016415162115107017326 0ustar00"""Test suite for bzrformats package.""" import atexit import difflib import logging import os import re import shutil import sys import tempfile import unittest try: import testtools except ImportError: # Minimal compatibility if testtools is not available testtools = None import importlib from urllib.parse import quote as urlquote from .. import osutils def _try_import(module_name): """Try to import a module, returning it or None if unavailable.""" try: return importlib.import_module(module_name) except ImportError: return None def pathname2url(path): """Convert a local pathname to a URL path.""" # On Unix, pathname2url is essentially identity with encoding of special chars # but preserving '/' return urlquote(path, safe="/:@") logger = logging.getLogger("bzrformats.tests") _unitialized_attr = object() """A sentinel needed to act as a default value in a method signature.""" def _rmtree_temp_dir(path, test_id=None): """Remove a temporary directory, handling errors.""" try: shutil.rmtree(path) except OSError: if test_id: print(f"Failed to remove temp dir {path} for test {test_id}") pass class TestCase(testtools.TestCase if testtools else unittest.TestCase): """Base class for bzrformats unit tests.""" def __init__(self, methodName="testMethod"): # noqa: N803 super().__init__(methodName) self._cleanups = [] def setUp(self): super().setUp() self._orig_cwd = os.getcwd() # Clear config to avoid external config affecting tests # Override HOME to prevent reading user configs import tempfile self._test_home_dir = tempfile.mkdtemp(prefix="brz-test-home-") self.addCleanup(__import__("shutil").rmtree, self._test_home_dir) self.overrideEnv("HOME", self._test_home_dir) self.overrideEnv("BRZ_HOME", self._test_home_dir) self.overrideEnv("EMAIL", "jrandom@example.com") self.overrideEnv("BRZ_EMAIL", None) def tearDown(self): try: # Run any registered cleanup functions while self._cleanups: func, args, kwargs = self._cleanups.pop() func(*args, **kwargs) finally: os.chdir(self._orig_cwd) super().tearDown() def addCleanup(self, func, *args, **kwargs): """Register a function to be called during tearDown.""" self._cleanups.append((func, args, kwargs)) def overrideAttr(self, obj, attr_name, new=_unitialized_attr): """Overrides an object attribute restoring it after the test.""" # The actual value is captured by the call below value = getattr(obj, attr_name, _unitialized_attr) if value is _unitialized_attr: # When the test completes, the attribute should not exist, but if # we aren't setting a value, we don't need to do anything. if new is not _unitialized_attr: self.addCleanup(delattr, obj, attr_name) else: self.addCleanup(setattr, obj, attr_name, value) if new is not _unitialized_attr: setattr(obj, attr_name, new) return value def overrideEnv(self, name, new_value): """Override an environment variable, restoring it during tearDown.""" old_value = os.environ.get(name) if new_value is None: if name in os.environ: del os.environ[name] else: os.environ[name] = new_value def restore(): if old_value is None: if name in os.environ: del os.environ[name] else: os.environ[name] = old_value self.addCleanup(restore) def assertEqualDiff(self, a, b, message=None): """Assert two texts are equal, if not raise an exception showing diffs.""" if a == b: return if message is None: message = "texts not equal:\n" if a + "\n" == b: message = "first string is missing a final newline.\n" if a == b + "\n": message = "second string is missing a final newline.\n" # Create a diff diff = difflib.unified_diff( a.splitlines(True), b.splitlines(True), "expected", "actual" ) raise AssertionError(message + "".join(diff)) def assertContainsRe(self, haystack, needle_re, flags=0): """Assert that haystack contains something matching a regular expression.""" if not re.search(needle_re, haystack, flags): raise AssertionError(f'pattern "{needle_re}" not found in "{haystack}"') def assertNotContainsRe(self, haystack, needle_re, flags=0): """Assert that haystack does not match a regular expression.""" if re.search(needle_re, haystack, flags): raise AssertionError(f'pattern "{needle_re}" found in "{haystack}"') def assertStartsWith(self, s, prefix): if not s.startswith(prefix): raise AssertionError(f"string {s!r} does not start with {prefix!r}") def assertEndsWith(self, s, suffix): if not s.endswith(suffix): raise AssertionError(f"string {s!r} does not end with {suffix!r}") def assertLength(self, expected_length, obj_with_len): """Assert that obj_with_len is of length expected_length.""" actual_length = len(obj_with_len) if actual_length != expected_length: self.fail( f"Incorrect length: wanted {expected_length}, got {actual_length} for {obj_with_len!r}" ) def assertIs(self, left, right, message=None): """Assert that left is right.""" if left is not right: if message is not None: raise AssertionError(message) else: raise AssertionError(f"{left!r} is not {right!r}.") def assertIsNot(self, left, right, message=None): """Assert that left is not right.""" if left is right: if message is not None: raise AssertionError(message) else: raise AssertionError(f"{left!r} is {right!r}.") def assertIsInstance(self, obj, klass, msg=None): """Assert that obj is an instance of klass.""" if not isinstance(obj, klass): if msg is None: msg = f"{obj!r} is not an instance of {klass}" raise AssertionError(msg) def log(self, *args): """Log a message.""" logger.debug(*args) def assertSubset(self, sublist, superlist): """Assert that every entry in sublist is present in superlist.""" missing = set(sublist) - set(superlist) if missing: raise AssertionError( f"Missing elements {missing!r}: {sublist!r} not a subset of {superlist!r}" ) def knownFailure(self, reason): """Mark test as a known failure.""" raise expectedFailure(reason) def requireFeature(self, feature): """This test requires a specific feature is available. :raises unittest.SkipTest: When feature is not available. """ if not feature.available(): self.skipTest(f"Feature {feature.feature_name()} not available") def assertPathExists(self, path): """Fail unless path or paths, which may be abs or relative, exist.""" if not isinstance(path, (bytes, str)): for p in path: if not os.path.exists(p): self.fail(f"path {p} does not exist") else: if not os.path.exists(path): self.fail(f"path {path} does not exist") def assertPathDoesNotExist(self, path): """Fail if path or paths, which may be abs or relative, exist.""" if not isinstance(path, (bytes, str)): for p in path: if os.path.exists(p): self.fail(f"path {p} exists") else: if os.path.exists(path): self.fail(f"path {path} exists") def assertFileEqual(self, content, path): """Fail if path does not contain 'content'.""" self.assertPathExists(path) mode = "r" + ("b" if isinstance(content, bytes) else "") with open(path, mode) as f: s = f.read() self.assertEqualDiff(content, s) def assertListRaises(self, excClass, func, *args, **kwargs): # noqa: N803 """Fail unless excClass is raised when the iterator from func is used. Many functions can return generators this makes sure to wrap them in a list() call to make sure the whole generator is run, and that the proper exception is raised. """ try: list(func(*args, **kwargs)) except excClass as e: return e else: if getattr(excClass, "__name__", None) is not None: excName = excClass.__name__ else: excName = str(excClass) raise self.failureException(f"{excName} not raised") def time(self, callable, *args, **kwargs): """Run callable and return result.""" # Simplified version - just run the callable without profiling return callable(*args, **kwargs) class TestCaseInTempDir(TestCase): """Test case that runs in a temporary directory. This is a minimal version of brz's TestCaseInTempDir. """ TEST_ROOT = None def setUp(self): super().setUp() self._make_test_root() self.addCleanup(os.chdir, os.getcwd()) self.makeAndChdirToTestDir() def _make_test_root(self): """Create the top-level test directory if needed.""" if TestCaseInTempDir.TEST_ROOT is None: root = os.path.realpath( tempfile.mkdtemp(prefix="testbzrformats-", suffix=".tmp") ) TestCaseInTempDir.TEST_ROOT = root atexit.register(_rmtree_temp_dir, root) def makeAndChdirToTestDir(self): """Create a temporary directory for this test and chdir to it.""" # Create test directory name based on test id test_name = self.id() if sys.platform in ("win32", "cygwin"): test_name = re.sub('[<>*=+",:;_/\\-]', "_", test_name) test_name = test_name[-30:] # Windows path length limits else: test_name = re.sub("[/]", "_", test_name) base_dir = os.path.join(TestCaseInTempDir.TEST_ROOT, test_name) # Find a unique directory name test_dir = base_dir for i in range(100): if not os.path.exists(test_dir): break test_dir = f"{base_dir}_{i}" else: raise RuntimeError( f"Could not create unique test directory for {test_name}" ) os.makedirs(test_dir) self.test_dir = test_dir self.addCleanup(_rmtree_temp_dir, test_dir, test_id=self.id()) os.chdir(test_dir) def build_tree(self, shape, line_endings="binary", transport=None): """Build a test tree according to a pattern. shape is a sequence of file specifications. If the final character is '/', a directory is created. """ for name in shape: if isinstance(name, tuple): name, content = name else: content = None if name.endswith("/"): os.makedirs(name, exist_ok=True) else: dirname = os.path.dirname(name) if dirname: os.makedirs(dirname, exist_ok=True) if content is None: content = f"contents of {name}\n" if isinstance(content, str): if line_endings == "native": content = content.replace("\n", os.linesep) content = content.encode("utf-8") with open(name, "wb") as f: f.write(content) @staticmethod def build_tree_contents(shape): """Build test files with specific contents.""" for entry in shape: if len(entry) == 2: name, content = entry else: name = entry[0] content = None if name.endswith("/"): os.makedirs(name, exist_ok=True) else: dirname = os.path.dirname(name) if dirname: os.makedirs(dirname, exist_ok=True) if content is None: content = b"" if isinstance(content, str): content = content.encode("utf-8") with open(name, "wb") as f: f.write(content) # Import TestSkipped from unittest TestSkipped = unittest.SkipTest class TestNotApplicable(TestSkipped): """Skip a test because it is not applicable to the current configuration.""" pass class TestCaseWithMemoryTransport(TestCase): """TestCase with a MemoryTransport for testing. Uses bzrformats' own MemoryTransport. Each test gets a fresh transport namespace based on the test ID. """ def setUp(self): super().setUp() from ..transport import MemoryTransport self._memory_transport = MemoryTransport(url=f"memory:///{self.id()}/") def get_transport(self, relpath=None): """Get the transport for this test case.""" if relpath is None or relpath == ".": return self._memory_transport t = self._memory_transport.clone(relpath) t.ensure_base() return t def get_url(self, relpath=None): """Get a URL for the memory transport.""" if relpath is None or relpath == ".": return self._memory_transport.base return self._memory_transport.abspath(relpath) def check_file_contents(self, filename, expect): """Check contents of a file on the transport.""" contents = self.get_transport().get_bytes(filename) if contents != expect: self.log(f"expected: {expect!r}") self.log(f"actually: {contents!r}") self.fail(f"contents of {filename} not as expected") def load_tests(loader, basic_tests, pattern): """Load tests for bzrformats using the standard unittest discovery mechanism.""" suite = loader.suiteClass() # Add the tests for this module suite.addTests(basic_tests) # List of test modules to load testmod_names = [ "per_inventory", "per_versionedfile", "test__btree_serializer", "test__chk_map", "test__dirstate_helpers", "test__groupcompress", "test_btree_index", "test_chk_map", "test_chk_serializer", "test_chunk_writer", "test_dirstate", "test_generate_ids", "test_groupcompress", "test_hashcache", "test_index", "test_inv", "test_inventory_delta", "test_knit", "test_pack", "test_rio", "test_serializer", "test_tuned_gzip", "test_versionedfile", "test_weave", "test_xml", ] # Load each test module prefix = __name__ + "." for testmod_name in testmod_names: suite.addTest(loader.loadTestsFromName(prefix + testmod_name)) # Also load per_* modules per_modules = [ "per_versionedfile", "per_inventory", ] for per_module in per_modules: try: suite.addTest(loader.loadTestsFromName(prefix + per_module)) except (ImportError, AttributeError): # Skip if module doesn't exist or has no tests pass return suite def test_suite(): """Return the test suite for bzrformats (for backwards compatibility).""" loader = unittest.TestLoader() basic_tests = loader.loadTestsFromModule(__import__(__name__, fromlist=[""])) return load_tests(loader, basic_tests, None) def dir_reader_scenarios(): """Simplified dir_reader_scenarios for bzrformats tests.""" # Only use the unicode reader which is always available return [ ( "unicode", { "_dir_reader_class": osutils.UnicodeDirReader, "_native_to_unicode": lambda x: x, # Already unicode }, ) ] bzrformats_3.4.0.orig/bzrformats/tests/per_inventory/0000755000000000000000000000000015162073400020112 5ustar00bzrformats_3.4.0.orig/bzrformats/tests/per_versionedfile.py0000644000000000000000000035540415162115103021275 0ustar00# Copyright (C) 2006-2012, 2016 Canonical Ltd # # Authors: # Johan Rydberg # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # TODO: might be nice to create a versionedfile with some type of corruption # considered typical and check that it can be detected/corrected. import contextlib import itertools from gzip import GzipFile from io import BytesIO from testscenarios import load_tests_apply_scenarios from vcsgraph import known_graph as _mod_known_graph from bzrformats import osutils from bzrformats.errors import ( OutSideTransaction, ReadOnlyError, ReservedId, RevisionAlreadyPresent, ) from .. import groupcompress from .. import knit as _mod_knit from .. import versionedfile as versionedfile from ..errors import RevisionNotPresent from ..knit import cleanup_pack_knit, make_file_factory, make_pack_factory from ..transport import MemoryTransport, TransportNoSuchFile from ..versionedfile import ( ChunkedContentFactory, ConstantMapper, ExistingContent, HashEscapedPrefixMapper, PrefixMapper, UnavailableRepresentation, VirtualVersionedFiles, make_versioned_files_factory, ) from ..weave import WeaveFile, WeaveInvalidChecksum from ..weavefile import write_weave from . import ( TestCase, TestCaseWithMemoryTransport, TestNotApplicable, TestSkipped, ) load_tests = load_tests_apply_scenarios def get_diamond_vf(f, trailing_eol=True, left_only=False): r"""Get a diamond graph to exercise deltas and merges. :param trailing_eol: If True end the last line with \n. """ parents = { b"origin": (), b"base": ((b"origin",),), b"left": ((b"base",),), b"right": ((b"base",),), b"merged": ((b"left",), (b"right",)), } # insert a diamond graph to exercise deltas and merges. last_char = b"\n" if trailing_eol else b"" f.add_lines(b"origin", [], [b"origin" + last_char]) f.add_lines(b"base", [b"origin"], [b"base" + last_char]) f.add_lines(b"left", [b"base"], [b"base\n", b"left" + last_char]) if not left_only: f.add_lines(b"right", [b"base"], [b"base\n", b"right" + last_char]) f.add_lines( b"merged", [b"left", b"right"], [b"base\n", b"left\n", b"right\n", b"merged" + last_char], ) return f, parents def get_diamond_files( files, key_length, trailing_eol=True, left_only=False, nograph=False, nokeys=False ): r"""Get a diamond graph to exercise deltas and merges. This creates a 5-node graph in files. If files supports 2-length keys two graphs are made to exercise the support for multiple ids. :param trailing_eol: If True end the last line with \n. :param key_length: The length of keys in files. Currently supports length 1 and 2 keys. :param left_only: If True do not add the right and merged nodes. :param nograph: If True, do not provide parents to the add_lines calls; this is useful for tests that need inserted data but have graphless stores. :param nokeys: If True, pass None is as the key for all insertions. Currently implies nograph. :return: The results of the add_lines calls. """ if nokeys: nograph = True prefixes = [()] if key_length == 1 else [(b"FileA",), (b"FileB",)] # insert a diamond graph to exercise deltas and merges. last_char = b"\n" if trailing_eol else b"" result = [] def get_parents(suffix_list): if nograph: return () else: result = [prefix + suffix for suffix in suffix_list] return result def get_key(suffix): if nokeys: return (None,) else: return (suffix,) # we loop over each key because that spreads the inserts across prefixes, # which is how commit operates. for prefix in prefixes: result.append( files.add_lines(prefix + get_key(b"origin"), (), [b"origin" + last_char]) ) for prefix in prefixes: result.append( files.add_lines( prefix + get_key(b"base"), get_parents([(b"origin",)]), [b"base" + last_char], ) ) for prefix in prefixes: result.append( files.add_lines( prefix + get_key(b"left"), get_parents([(b"base",)]), [b"base\n", b"left" + last_char], ) ) if not left_only: for prefix in prefixes: result.append( files.add_lines( prefix + get_key(b"right"), get_parents([(b"base",)]), [b"base\n", b"right" + last_char], ) ) for prefix in prefixes: result.append( files.add_lines( prefix + get_key(b"merged"), get_parents([(b"left",), (b"right",)]), [b"base\n", b"left\n", b"right\n", b"merged" + last_char], ) ) return result class VersionedFileTestMixIn: """A mixin test class for testing VersionedFiles. This is not an adaptor-style test at this point because theres no dynamic substitution of versioned file implementations, they are strictly controlled by their owning repositories. """ def get_transaction(self): if not hasattr(self, "_transaction"): self._transaction = None return self._transaction def test_add(self): f = self.get_file() f.add_lines(b"r0", [], [b"a\n", b"b\n"]) f.add_lines(b"r1", [b"r0"], [b"b\n", b"c\n"]) def verify_file(f): versions = f.versions() self.assertTrue(b"r0" in versions) self.assertTrue(b"r1" in versions) self.assertEqual(f.get_lines(b"r0"), [b"a\n", b"b\n"]) self.assertEqual(f.get_text(b"r0"), b"a\nb\n") self.assertEqual(f.get_lines(b"r1"), [b"b\n", b"c\n"]) self.assertEqual(2, len(f)) self.assertEqual(2, f.num_versions()) self.assertRaises(RevisionNotPresent, f.add_lines, b"r2", [b"foo"], []) self.assertRaises(RevisionAlreadyPresent, f.add_lines, b"r1", [], []) verify_file(f) # this checks that reopen with create=True does not break anything. f = self.reopen_file(create=True) verify_file(f) def test_adds_with_parent_texts(self): f = self.get_file() parent_texts = {} _, _, parent_texts[b"r0"] = f.add_lines(b"r0", [], [b"a\n", b"b\n"]) try: _, _, parent_texts[b"r1"] = f.add_lines_with_ghosts( b"r1", [b"r0", b"ghost"], [b"b\n", b"c\n"], parent_texts=parent_texts ) except NotImplementedError: # if the format doesn't support ghosts, just add normally. _, _, parent_texts[b"r1"] = f.add_lines( b"r1", [b"r0"], [b"b\n", b"c\n"], parent_texts=parent_texts ) f.add_lines(b"r2", [b"r1"], [b"c\n", b"d\n"], parent_texts=parent_texts) self.assertNotEqual(None, parent_texts[b"r0"]) self.assertNotEqual(None, parent_texts[b"r1"]) def verify_file(f): versions = f.versions() self.assertTrue(b"r0" in versions) self.assertTrue(b"r1" in versions) self.assertTrue(b"r2" in versions) self.assertEqual(f.get_lines(b"r0"), [b"a\n", b"b\n"]) self.assertEqual(f.get_lines(b"r1"), [b"b\n", b"c\n"]) self.assertEqual(f.get_lines(b"r2"), [b"c\n", b"d\n"]) self.assertEqual(3, f.num_versions()) origins = f.annotate(b"r1") self.assertEqual(origins[0][0], b"r0") self.assertEqual(origins[1][0], b"r1") origins = f.annotate(b"r2") self.assertEqual(origins[0][0], b"r1") self.assertEqual(origins[1][0], b"r2") verify_file(f) f = self.reopen_file() verify_file(f) def test_add_unicode_content(self): # unicode content is not permitted in versioned files. # versioned files version sequences of bytes only. vf = self.get_file() self.assertRaises( TypeError, vf.add_lines, b"a", [], [b"a\n", "b\n", b"c\n"], ) self.assertRaises( (TypeError, NotImplementedError), vf.add_lines_with_ghosts, b"a", [], [b"a\n", "b\n", b"c\n"], ) def test_add_follows_left_matching_blocks(self): """If we change left_matching_blocks, delta changes. Note: There are multiple correct deltas in this case, because we start with 1 "a" and we get 3. """ vf = self.get_file() if isinstance(vf, WeaveFile): raise TestSkipped("WeaveFile ignores left_matching_blocks") vf.add_lines(b"1", [], [b"a\n"]) vf.add_lines( b"2", [b"1"], [b"a\n", b"a\n", b"a\n"], left_matching_blocks=[(0, 0, 1), (1, 3, 0)], ) self.assertEqual([b"a\n", b"a\n", b"a\n"], vf.get_lines(b"2")) vf.add_lines( b"3", [b"1"], [b"a\n", b"a\n", b"a\n"], left_matching_blocks=[(0, 2, 1), (1, 3, 0)], ) self.assertEqual([b"a\n", b"a\n", b"a\n"], vf.get_lines(b"3")) def test_inline_newline_throws(self): # \r characters are not permitted in lines being added vf = self.get_file() self.assertRaises(ValueError, vf.add_lines, b"a", [], [b"a\n\n"]) self.assertRaises( (ValueError, NotImplementedError), vf.add_lines_with_ghosts, b"a", [], [b"a\n\n"], ) # but inline CR's are allowed vf.add_lines(b"a", [], [b"a\r\n"]) with contextlib.suppress(NotImplementedError): vf.add_lines_with_ghosts(b"b", [], [b"a\r\n"]) def test_add_reserved(self): vf = self.get_file() self.assertRaises(ReservedId, vf.add_lines, b"a:", [], [b"a\n", b"b\n", b"c\n"]) def test_add_lines_nostoresha(self): """When nostore_sha is supplied using old content raises.""" vf = self.get_file() empty_text = (b"a", []) sample_text_nl = (b"b", [b"foo\n", b"bar\n"]) sample_text_no_nl = (b"c", [b"foo\n", b"bar"]) shas = [] for version, lines in (empty_text, sample_text_nl, sample_text_no_nl): sha, _, _ = vf.add_lines(version, [], lines) shas.append(sha) # we now have a copy of all the lines in the vf. for sha, (version, lines) in zip( shas, (empty_text, sample_text_nl, sample_text_no_nl), strict=False ): self.assertRaises( ExistingContent, vf.add_lines, version + b"2", [], lines, nostore_sha=sha, ) # and no new version should have been added. self.assertRaises(RevisionNotPresent, vf.get_lines, version + b"2") def test_add_lines_with_ghosts_nostoresha(self): """When nostore_sha is supplied using old content raises.""" vf = self.get_file() empty_text = (b"a", []) sample_text_nl = (b"b", [b"foo\n", b"bar\n"]) sample_text_no_nl = (b"c", [b"foo\n", b"bar"]) shas = [] for version, lines in (empty_text, sample_text_nl, sample_text_no_nl): sha, _, _ = vf.add_lines(version, [], lines) shas.append(sha) # we now have a copy of all the lines in the vf. # is the test applicable to this vf implementation? try: vf.add_lines_with_ghosts(b"d", [], []) except NotImplementedError as e: raise TestSkipped("add_lines_with_ghosts is optional") from e for sha, (version, lines) in zip( shas, (empty_text, sample_text_nl, sample_text_no_nl), strict=False ): self.assertRaises( ExistingContent, vf.add_lines_with_ghosts, version + b"2", [], lines, nostore_sha=sha, ) # and no new version should have been added. self.assertRaises(RevisionNotPresent, vf.get_lines, version + b"2") def test_add_lines_return_value(self): # add_lines should return the sha1 and the text size. vf = self.get_file() empty_text = (b"a", []) sample_text_nl = (b"b", [b"foo\n", b"bar\n"]) sample_text_no_nl = (b"c", [b"foo\n", b"bar"]) # check results for the three cases: for version, lines in (empty_text, sample_text_nl, sample_text_no_nl): # the first two elements are the same for all versioned files: # - the digest and the size of the text. For some versioned files # additional data is returned in additional tuple elements. result = vf.add_lines(version, [], lines) self.assertEqual(3, len(result)) self.assertEqual( (osutils.sha_strings(lines), sum(map(len, lines))), result[0:2] ) # parents should not affect the result: lines = sample_text_nl[1] self.assertEqual( (osutils.sha_strings(lines), sum(map(len, lines))), vf.add_lines(b"d", [b"b", b"c"], lines)[0:2], ) def test_get_reserved(self): vf = self.get_file() self.assertRaises(ReservedId, vf.get_texts, [b"b:"]) self.assertRaises(ReservedId, vf.get_lines, b"b:") self.assertRaises(ReservedId, vf.get_text, b"b:") def test_add_unchanged_last_line_noeol_snapshot(self): """Add a text with an unchanged last line with no eol should work.""" # Test adding this in a number of chain lengths; because the interface # for VersionedFile does not allow forcing a specific chain length, we # just use a small base to get the first snapshot, then a much longer # first line for the next add (which will make the third add snapshot) # and so on. 20 has been chosen as an aribtrary figure - knits use 200 # as a capped delta length, but ideally we would have some way of # tuning the test to the store (e.g. keep going until a snapshot # happens). for length in range(20): version_lines = {} vf = self.get_file("case-%d" % length) prefix = b"step-%d" parents = [] for step in range(length): version = prefix % step lines = ([b"prelude \n"] * step) + [b"line"] vf.add_lines(version, parents, lines) version_lines[version] = lines parents = [version] vf.add_lines(b"no-eol", parents, [b"line"]) vf.get_texts(version_lines.keys()) self.assertEqualDiff(b"line", vf.get_text(b"no-eol")) def test_get_texts_eol_variation(self): # similar to the failure in vf = self.get_file() sample_text_nl = [b"line\n"] sample_text_no_nl = [b"line"] versions = [] version_lines = {} parents = [] for i in range(4): version = b"v%d" % i lines = sample_text_nl if i % 2 else sample_text_no_nl # left_matching blocks is an internal api; it operates on the # *internal* representation for a knit, which is with *all* lines # being normalised to end with \n - even the final line in a no_nl # file. Using it here ensures that a broken internal implementation # (which is what this test tests) will generate a correct line # delta (which is to say, an empty delta). vf.add_lines(version, parents, lines, left_matching_blocks=[(0, 0, 1)]) parents = [version] versions.append(version) version_lines[version] = lines vf.check() vf.get_texts(versions) vf.get_texts(reversed(versions)) def test_add_lines_with_matching_blocks_noeol_last_line(self): """Add a text with an unchanged last line with no eol should work.""" # Hand verified sha1 of the text we're adding. # Create a mpdiff which adds a new line before the trailing line, and # reuse the last line unaltered (which can cause annotation reuse). # Test adding this in two situations: # On top of a new insertion vf = self.get_file("fulltext") vf.add_lines(b"noeol", [], [b"line"]) vf.add_lines( b"noeol2", [b"noeol"], [b"newline\n", b"line"], left_matching_blocks=[(0, 1, 1)], ) self.assertEqualDiff(b"newline\nline", vf.get_text(b"noeol2")) # On top of a delta vf = self.get_file("delta") vf.add_lines(b"base", [], [b"line"]) vf.add_lines(b"noeol", [b"base"], [b"prelude\n", b"line"]) vf.add_lines( b"noeol2", [b"noeol"], [b"newline\n", b"line"], left_matching_blocks=[(1, 1, 1)], ) self.assertEqualDiff(b"newline\nline", vf.get_text(b"noeol2")) def test_make_mpdiffs(self): from .. import multiparent vf = self.get_file("foo") self._setup_for_deltas(vf) new_vf = self.get_file("bar") for version in multiparent.topo_iter(vf): mpdiff = vf.make_mpdiffs([version])[0] new_vf.add_mpdiffs( [ ( version, vf.get_parent_map([version])[version], vf.get_sha1s([version])[version], mpdiff, ) ] ) self.assertEqualDiff(vf.get_text(version), new_vf.get_text(version)) def test_make_mpdiffs_with_ghosts(self): vf = self.get_file("foo") try: vf.add_lines_with_ghosts(b"text", [b"ghost"], [b"line\n"]) except NotImplementedError: # old Weave formats do not allow ghosts return self.assertRaises(RevisionNotPresent, vf.make_mpdiffs, [b"ghost"]) def _setup_for_deltas(self, f): self.assertFalse(f.has_version("base")) # add texts that should trip the knit maximum delta chain threshold # as well as doing parallel chains of data in knits. # this is done by two chains of 25 insertions f.add_lines(b"base", [], [b"line\n"]) f.add_lines(b"noeol", [b"base"], [b"line"]) # detailed eol tests: # shared last line with parent no-eol f.add_lines(b"noeolsecond", [b"noeol"], [b"line\n", b"line"]) # differing last line with parent, both no-eol f.add_lines(b"noeolnotshared", [b"noeolsecond"], [b"line\n", b"phone"]) # add eol following a noneol parent, change content f.add_lines(b"eol", [b"noeol"], [b"phone\n"]) # add eol following a noneol parent, no change content f.add_lines(b"eolline", [b"noeol"], [b"line\n"]) # noeol with no parents: f.add_lines(b"noeolbase", [], [b"line"]) # noeol preceeding its leftmost parent in the output: # this is done by making it a merge of two parents with no common # anestry: noeolbase and noeol with the # later-inserted parent the leftmost. f.add_lines(b"eolbeforefirstparent", [b"noeolbase", b"noeol"], [b"line"]) # two identical eol texts f.add_lines(b"noeoldup", [b"noeol"], [b"line"]) next_parent = b"base" text_name = b"chain1-" text = [b"line\n"] sha1s = { 0: b"da6d3141cb4a5e6f464bf6e0518042ddc7bfd079", 1: b"45e21ea146a81ea44a821737acdb4f9791c8abe7", 2: b"e1f11570edf3e2a070052366c582837a4fe4e9fa", 3: b"26b4b8626da827088c514b8f9bbe4ebf181edda1", 4: b"e28a5510be25ba84d31121cff00956f9970ae6f6", 5: b"d63ec0ce22e11dcf65a931b69255d3ac747a318d", 6: b"2c2888d288cb5e1d98009d822fedfe6019c6a4ea", 7: b"95c14da9cafbf828e3e74a6f016d87926ba234ab", 8: b"779e9a0b28f9f832528d4b21e17e168c67697272", 9: b"1f8ff4e5c6ff78ac106fcfe6b1e8cb8740ff9a8f", 10: b"131a2ae712cf51ed62f143e3fbac3d4206c25a05", 11: b"c5a9d6f520d2515e1ec401a8f8a67e6c3c89f199", 12: b"31a2286267f24d8bedaa43355f8ad7129509ea85", 13: b"dc2a7fe80e8ec5cae920973973a8ee28b2da5e0a", 14: b"2c4b1736566b8ca6051e668de68650686a3922f2", 15: b"5912e4ecd9b0c07be4d013e7e2bdcf9323276cde", 16: b"b0d2e18d3559a00580f6b49804c23fea500feab3", 17: b"8e1d43ad72f7562d7cb8f57ee584e20eb1a69fc7", 18: b"5cf64a3459ae28efa60239e44b20312d25b253f3", 19: b"1ebed371807ba5935958ad0884595126e8c4e823", 20: b"2aa62a8b06fb3b3b892a3292a068ade69d5ee0d3", 21: b"01edc447978004f6e4e962b417a4ae1955b6fe5d", 22: b"d8d8dc49c4bf0bab401e0298bb5ad827768618bb", 23: b"c21f62b1c482862983a8ffb2b0c64b3451876e3f", 24: b"c0593fe795e00dff6b3c0fe857a074364d5f04fc", 25: b"dd1a1cf2ba9cc225c3aff729953e6364bf1d1855", } for depth in range(26): new_version = text_name + b"%d" % depth text = text + [b"line\n"] f.add_lines(new_version, [next_parent], text) next_parent = new_version next_parent = b"base" text_name = b"chain2-" text = [b"line\n"] for depth in range(26): new_version = text_name + b"%d" % depth text = text + [b"line\n"] f.add_lines(new_version, [next_parent], text) next_parent = new_version return sha1s def test_ancestry(self): f = self.get_file() self.assertEqual(set(), f.get_ancestry([])) f.add_lines(b"r0", [], [b"a\n", b"b\n"]) f.add_lines(b"r1", [b"r0"], [b"b\n", b"c\n"]) f.add_lines(b"r2", [b"r0"], [b"b\n", b"c\n"]) f.add_lines(b"r3", [b"r2"], [b"b\n", b"c\n"]) f.add_lines(b"rM", [b"r1", b"r2"], [b"b\n", b"c\n"]) self.assertEqual(set(), f.get_ancestry([])) f.get_ancestry([b"rM"]) self.assertRaises(RevisionNotPresent, f.get_ancestry, [b"rM", b"rX"]) self.assertEqual(set(f.get_ancestry(b"rM")), set(f.get_ancestry(b"rM"))) def test_mutate_after_finish(self): self._transaction = "before" f = self.get_file() self._transaction = "after" self.assertRaises(OutSideTransaction, f.add_lines, b"", [], []) self.assertRaises(OutSideTransaction, f.add_lines_with_ghosts, b"", [], []) def test_copy_to(self): f = self.get_file() f.add_lines(b"0", [], [b"a\n"]) t = MemoryTransport() f.copy_to("foo", t) for suffix in self.get_factory().get_suffixes(): self.assertTrue(t.has("foo" + suffix)) def test_get_suffixes(self): self.get_file() # and should be a list self.assertTrue(isinstance(self.get_factory().get_suffixes(), list)) def test_get_parent_map(self): f = self.get_file() f.add_lines(b"r0", [], [b"a\n", b"b\n"]) self.assertEqual({b"r0": ()}, f.get_parent_map([b"r0"])) f.add_lines(b"r1", [b"r0"], [b"a\n", b"b\n"]) self.assertEqual({b"r1": (b"r0",)}, f.get_parent_map([b"r1"])) self.assertEqual({b"r0": (), b"r1": (b"r0",)}, f.get_parent_map([b"r0", b"r1"])) f.add_lines(b"r2", [], [b"a\n", b"b\n"]) f.add_lines(b"r3", [], [b"a\n", b"b\n"]) f.add_lines(b"m", [b"r0", b"r1", b"r2", b"r3"], [b"a\n", b"b\n"]) self.assertEqual({b"m": (b"r0", b"r1", b"r2", b"r3")}, f.get_parent_map([b"m"])) self.assertEqual({}, f.get_parent_map([b"y"])) self.assertEqual( {b"r0": (), b"r1": (b"r0",)}, f.get_parent_map([b"r0", b"y", b"r1"]) ) def test_annotate(self): f = self.get_file() f.add_lines(b"r0", [], [b"a\n", b"b\n"]) f.add_lines(b"r1", [b"r0"], [b"c\n", b"b\n"]) origins = f.annotate(b"r1") self.assertEqual(origins[0][0], b"r1") self.assertEqual(origins[1][0], b"r0") self.assertRaises(RevisionNotPresent, f.annotate, b"foo") def test_detection(self): # Test weaves detect corruption. # # Weaves contain a checksum of their texts. # When a text is extracted, this checksum should be # verified. w = self.get_file_corrupted_text() self.assertEqual(b"hello\n", w.get_text(b"v1")) self.assertRaises(WeaveInvalidChecksum, w.get_text, b"v2") self.assertRaises(WeaveInvalidChecksum, w.get_lines, b"v2") self.assertRaises(WeaveInvalidChecksum, w.check) w = self.get_file_corrupted_checksum() self.assertEqual(b"hello\n", w.get_text(b"v1")) self.assertRaises(WeaveInvalidChecksum, w.get_text, b"v2") self.assertRaises(WeaveInvalidChecksum, w.get_lines, b"v2") self.assertRaises(WeaveInvalidChecksum, w.check) def get_file_corrupted_text(self): """Return a versioned file with corrupt text but valid metadata.""" raise NotImplementedError(self.get_file_corrupted_text) def reopen_file(self, name="foo"): """Open the versioned file from disk again.""" raise NotImplementedError(self.reopen_file) def test_iter_lines_added_or_present_in_versions(self): # test that we get at least an equalset of the lines added by # versions in the weave # the ordering here is to make a tree so that dumb searches have # more changes to muck up. class InstrumentedProgress: def __init__(self): self.updates = [] def update(self, msg=None, current=None, total=None): self.updates.append((msg, current, total)) def finished(self): pass vf = self.get_file() # add a base to get included vf.add_lines(b"base", [], [b"base\n"]) # add a ancestor to be included on one side vf.add_lines(b"lancestor", [], [b"lancestor\n"]) # add a ancestor to be included on the other side vf.add_lines(b"rancestor", [b"base"], [b"rancestor\n"]) # add a child of rancestor with no eofile-nl vf.add_lines(b"child", [b"rancestor"], [b"base\n", b"child\n"]) # add a child of lancestor and base to join the two roots vf.add_lines( b"otherchild", [b"lancestor", b"base"], [b"base\n", b"lancestor\n", b"otherchild\n"], ) def iter_with_versions(versions, expected): # now we need to see what lines are returned, and how often. lines = {} progress = InstrumentedProgress() # iterate over the lines for line in vf.iter_lines_added_or_present_in_versions( versions, pb=progress ): lines.setdefault(line, 0) lines[line] += 1 if progress.updates != []: self.assertEqual(expected, progress.updates) return lines lines = iter_with_versions( [b"child", b"otherchild"], [ ("Walking content", 0, 2), ("Walking content", 1, 2), ("Walking content", 2, 2), ], ) # we must see child and otherchild self.assertTrue(lines[(b"child\n", b"child")] > 0) self.assertTrue(lines[(b"otherchild\n", b"otherchild")] > 0) # we dont care if we got more than that. # test all lines lines = iter_with_versions( None, [ ("Walking content", 0, 5), ("Walking content", 1, 5), ("Walking content", 2, 5), ("Walking content", 3, 5), ("Walking content", 4, 5), ("Walking content", 5, 5), ], ) # all lines must be seen at least once self.assertTrue(lines[(b"base\n", b"base")] > 0) self.assertTrue(lines[(b"lancestor\n", b"lancestor")] > 0) self.assertTrue(lines[(b"rancestor\n", b"rancestor")] > 0) self.assertTrue(lines[(b"child\n", b"child")] > 0) self.assertTrue(lines[(b"otherchild\n", b"otherchild")] > 0) def test_add_lines_with_ghosts(self): # some versioned file formats allow lines to be added with parent # information that is > than that in the format. Formats that do # not support this need to raise NotImplementedError on the # add_lines_with_ghosts api. vf = self.get_file() # add a revision with ghost parents # The preferred form is utf8, but we should translate when needed parent_id_unicode = "b\xbfse" parent_id_utf8 = parent_id_unicode.encode("utf8") try: vf.add_lines_with_ghosts(b"notbxbfse", [parent_id_utf8], []) except NotImplementedError: # check the other ghost apis are also not implemented self.assertRaises( NotImplementedError, vf.get_ancestry_with_ghosts, [b"foo"] ) self.assertRaises(NotImplementedError, vf.get_parents_with_ghosts, b"foo") return vf = self.reopen_file() # test key graph related apis: getncestry, _graph, get_parents # has_version # - these are ghost unaware and must not be reflect ghosts self.assertEqual({b"notbxbfse"}, vf.get_ancestry(b"notbxbfse")) self.assertFalse(vf.has_version(parent_id_utf8)) # we have _with_ghost apis to give us ghost information. self.assertEqual( {parent_id_utf8, b"notbxbfse"}, vf.get_ancestry_with_ghosts([b"notbxbfse"]) ) self.assertEqual([parent_id_utf8], vf.get_parents_with_ghosts(b"notbxbfse")) # if we add something that is a ghost of another, it should correct the # results of the prior apis vf.add_lines(parent_id_utf8, [], []) self.assertEqual( {parent_id_utf8, b"notbxbfse"}, vf.get_ancestry([b"notbxbfse"]) ) self.assertEqual( {b"notbxbfse": (parent_id_utf8,)}, vf.get_parent_map([b"notbxbfse"]) ) self.assertTrue(vf.has_version(parent_id_utf8)) # we have _with_ghost apis to give us ghost information. self.assertEqual( {parent_id_utf8, b"notbxbfse"}, vf.get_ancestry_with_ghosts([b"notbxbfse"]) ) self.assertEqual([parent_id_utf8], vf.get_parents_with_ghosts(b"notbxbfse")) def test_add_lines_with_ghosts_after_normal_revs(self): # some versioned file formats allow lines to be added with parent # information that is > than that in the format. Formats that do # not support this need to raise NotImplementedError on the # add_lines_with_ghosts api. vf = self.get_file() # probe for ghost support try: vf.add_lines_with_ghosts(b"base", [], [b"line\n", b"line_b\n"]) except NotImplementedError: return vf.add_lines_with_ghosts( b"references_ghost", [b"base", b"a_ghost"], [b"line\n", b"line_b\n", b"line_c\n"], ) origins = vf.annotate(b"references_ghost") self.assertEqual((b"base", b"line\n"), origins[0]) self.assertEqual((b"base", b"line_b\n"), origins[1]) self.assertEqual((b"references_ghost", b"line_c\n"), origins[2]) def test_readonly_mode(self): t = self.get_transport() factory = self.get_factory() vf = factory("id", t, 0o777, create=True, access_mode="w") vf = factory("id", t, access_mode="r") self.assertRaises(ReadOnlyError, vf.add_lines, b"base", [], []) self.assertRaises(ReadOnlyError, vf.add_lines_with_ghosts, b"base", [], []) def test_get_sha1s(self): # check the sha1 data is available vf = self.get_file() # a simple file vf.add_lines(b"a", [], [b"a\n"]) # the same file, different metadata vf.add_lines(b"b", [b"a"], [b"a\n"]) # a file differing only in last newline. vf.add_lines(b"c", [], [b"a"]) self.assertEqual( { b"a": b"3f786850e387550fdab836ed7e6dc881de23001b", b"c": b"86f7e437faa5a7fce15d1ddcb9eaeaea377667b8", b"b": b"3f786850e387550fdab836ed7e6dc881de23001b", }, vf.get_sha1s([b"a", b"c", b"b"]), ) class TestWeave(TestCaseWithMemoryTransport, VersionedFileTestMixIn): def get_file(self, name="foo"): return WeaveFile( name, self.get_transport(), create=True, get_scope=self.get_transaction ) def get_file_corrupted_text(self): w = WeaveFile( "foo", self.get_transport(), create=True, get_scope=self.get_transaction ) w.add_lines(b"v1", [], [b"hello\n"]) w.add_lines(b"v2", [b"v1"], [b"hello\n", b"there\n"]) # We are going to invasively corrupt the text # Make sure the internals of weave are the same self.assertEqual( [(b"{", 0), b"hello\n", (b"}", None), (b"{", 1), b"there\n", (b"}", None)], w._weave, ) self.assertEqual( [ b"f572d396fae9206628714fb2ce00f72e94f2258f", b"90f265c6e75f1c8f9ab76dcf85528352c5f215ef", ], w._sha1s, ) w.check() # Corrupted w._weave[4] = b"There\n" return w def get_file_corrupted_checksum(self): w = self.get_file_corrupted_text() # Corrected w._weave[4] = b"there\n" self.assertEqual(b"hello\nthere\n", w.get_text(b"v2")) # Invalid checksum, first digit changed w._sha1s[1] = b"f0f265c6e75f1c8f9ab76dcf85528352c5f215ef" return w def reopen_file(self, name="foo", create=False): return WeaveFile( name, self.get_transport(), create=create, get_scope=self.get_transaction ) def test_no_implicit_create(self): self.assertRaises( TransportNoSuchFile, WeaveFile, "foo", self.get_transport(), get_scope=self.get_transaction, ) def get_factory(self): return WeaveFile class TestPlanMergeVersionedFile(TestCaseWithMemoryTransport): def setUp(self): super().setUp() mapper = PrefixMapper() factory = make_file_factory(True, mapper) self.vf1 = factory(self.get_transport("root-1")) self.vf2 = factory(self.get_transport("root-2")) self.plan_merge_vf = versionedfile._PlanMergeVersionedFile("root") self.plan_merge_vf.fallback_versionedfiles.extend([self.vf1, self.vf2]) def test_add_lines(self): self.plan_merge_vf.add_lines((b"root", b"a:"), [], []) self.assertRaises( ValueError, self.plan_merge_vf.add_lines, (b"root", b"a"), [], [] ) self.assertRaises( ValueError, self.plan_merge_vf.add_lines, (b"root", b"a:"), None, [] ) self.assertRaises( ValueError, self.plan_merge_vf.add_lines, (b"root", b"a:"), [], None ) def setup_abcde(self): self.vf1.add_lines((b"root", b"A"), [], [b"a"]) self.vf1.add_lines((b"root", b"B"), [(b"root", b"A")], [b"b"]) self.vf2.add_lines((b"root", b"C"), [], [b"c"]) self.vf2.add_lines((b"root", b"D"), [(b"root", b"C")], [b"d"]) self.plan_merge_vf.add_lines( (b"root", b"E:"), [(b"root", b"B"), (b"root", b"D")], [b"e"] ) def test_get_parents(self): self.setup_abcde() self.assertEqual( {(b"root", b"B"): ((b"root", b"A"),)}, self.plan_merge_vf.get_parent_map([(b"root", b"B")]), ) self.assertEqual( {(b"root", b"D"): ((b"root", b"C"),)}, self.plan_merge_vf.get_parent_map([(b"root", b"D")]), ) self.assertEqual( {(b"root", b"E:"): ((b"root", b"B"), (b"root", b"D"))}, self.plan_merge_vf.get_parent_map([(b"root", b"E:")]), ) self.assertEqual({}, self.plan_merge_vf.get_parent_map([(b"root", b"F")])) self.assertEqual( { (b"root", b"B"): ((b"root", b"A"),), (b"root", b"D"): ((b"root", b"C"),), (b"root", b"E:"): ((b"root", b"B"), (b"root", b"D")), }, self.plan_merge_vf.get_parent_map( [(b"root", b"B"), (b"root", b"D"), (b"root", b"E:"), (b"root", b"F")] ), ) def test_get_record_stream(self): self.setup_abcde() def get_record(suffix): return next( self.plan_merge_vf.get_record_stream( [(b"root", suffix)], "unordered", True ) ) self.assertEqual(b"a", get_record(b"A").get_bytes_as("fulltext")) self.assertEqual(b"a", b"".join(get_record(b"A").iter_bytes_as("chunked"))) self.assertEqual(b"c", get_record(b"C").get_bytes_as("fulltext")) self.assertEqual(b"e", get_record(b"E:").get_bytes_as("fulltext")) self.assertEqual("absent", get_record(b"F").storage_kind) class MergeCasesMixin: def doMerge(self, base, a, b, mp): def addcrlf(x): return x + b"\n" w = self.get_file() w.add_lines(b"text0", [], list(map(addcrlf, base))) w.add_lines(b"text1", [b"text0"], list(map(addcrlf, a))) w.add_lines(b"text2", [b"text0"], list(map(addcrlf, b))) self.log_contents(w) self.log("merge plan:") p = list(w.plan_merge(b"text1", b"text2")) for state, line in p: if line: self.log("%12s | %s" % (state, line[:-1])) self.log("merge:") mt = BytesIO() mt.writelines(w.weave_merge(p)) mt.seek(0) self.log(mt.getvalue()) mp = list(map(addcrlf, mp)) self.assertEqual(mt.readlines(), mp) def testOneInsert(self): self.doMerge([], [b"aa"], [], [b"aa"]) def testSeparateInserts(self): self.doMerge( [b"aaa", b"bbb", b"ccc"], [b"aaa", b"xxx", b"bbb", b"ccc"], [b"aaa", b"bbb", b"yyy", b"ccc"], [b"aaa", b"xxx", b"bbb", b"yyy", b"ccc"], ) def testSameInsert(self): self.doMerge( [b"aaa", b"bbb", b"ccc"], [b"aaa", b"xxx", b"bbb", b"ccc"], [b"aaa", b"xxx", b"bbb", b"yyy", b"ccc"], [b"aaa", b"xxx", b"bbb", b"yyy", b"ccc"], ) overlapped_insert_expected = [b"aaa", b"xxx", b"yyy", b"bbb"] def testOverlappedInsert(self): self.doMerge( [b"aaa", b"bbb"], [b"aaa", b"xxx", b"yyy", b"bbb"], [b"aaa", b"xxx", b"bbb"], self.overlapped_insert_expected, ) # really it ought to reduce this to # [b'aaa', b'xxx', b'yyy', b'bbb'] def testClashReplace(self): self.doMerge( [b"aaa"], [b"xxx"], [b"yyy", b"zzz"], [b"<<<<<<< ", b"xxx", b"=======", b"yyy", b"zzz", b">>>>>>> "], ) def testNonClashInsert1(self): self.doMerge( [b"aaa"], [b"xxx", b"aaa"], [b"yyy", b"zzz"], [b"<<<<<<< ", b"xxx", b"aaa", b"=======", b"yyy", b"zzz", b">>>>>>> "], ) def testNonClashInsert2(self): self.doMerge([b"aaa"], [b"aaa"], [b"yyy", b"zzz"], [b"yyy", b"zzz"]) def testDeleteAndModify(self): """Clashing delete and modification. If one side modifies a region and the other deletes it then there should be a conflict with one side blank. """ ####################################### # skippd, not working yet return self.doMerge( [b"aaa", b"bbb", b"ccc"], [b"aaa", b"ddd", b"ccc"], [b"aaa", b"ccc"], [b"<<<<<<<< ", b"aaa", b"=======", b">>>>>>> ", b"ccc"], ) def _test_merge_from_strings(self, base, a, b, expected): w = self.get_file() w.add_lines(b"text0", [], base.splitlines(True)) w.add_lines(b"text1", [b"text0"], a.splitlines(True)) w.add_lines(b"text2", [b"text0"], b.splitlines(True)) self.log("merge plan:") p = list(w.plan_merge(b"text1", b"text2")) for state, line in p: if line: self.log("%12s | %s" % (state, line[:-1])) self.log("merge result:") result_text = b"".join(w.weave_merge(p)) self.log(result_text) self.assertEqualDiff(result_text, expected) def test_weave_merge_conflicts(self): # does weave merge properly handle plans that end with unchanged? result = b"".join(self.get_file().weave_merge([("new-a", b"hello\n")])) self.assertEqual(result, b"hello\n") def test_deletion_extended(self): """One side deletes, the other deletes more.""" base = b"""\ line 1 line 2 line 3 """ a = b"""\ line 1 line 2 """ b = b"""\ line 1 """ result = b"""\ line 1 <<<<<<<\x20 line 2 ======= >>>>>>>\x20 """ self._test_merge_from_strings(base, a, b, result) def test_deletion_overlap(self): """Delete overlapping regions with no other conflict. Arguably it'd be better to treat these as agreement, rather than conflict, but for now conflict is safer. """ base = b"""\ start context int a() {} int b() {} int c() {} end context """ a = b"""\ start context int a() {} end context """ b = b"""\ start context int c() {} end context """ result = b"""\ start context <<<<<<<\x20 int a() {} ======= int c() {} >>>>>>>\x20 end context """ self._test_merge_from_strings(base, a, b, result) def test_agreement_deletion(self): """Agree to delete some lines, without conflicts.""" base = b"""\ start context base line 1 base line 2 end context """ a = b"""\ start context base line 1 end context """ b = b"""\ start context base line 1 end context """ result = b"""\ start context base line 1 end context """ self._test_merge_from_strings(base, a, b, result) def test_sync_on_deletion(self): """Specific case of merge where we can synchronize incorrectly. A previous version of the weave merge concluded that the two versions agreed on deleting line 2, and this could be a synchronization point. Line 1 was then considered in isolation, and thought to be deleted on both sides. It's better to consider the whole thing as a disagreement region. """ base = b"""\ start context base line 1 base line 2 end context """ a = b"""\ start context base line 1 a's replacement line 2 end context """ b = b"""\ start context b replaces both lines end context """ result = b"""\ start context <<<<<<<\x20 base line 1 a's replacement line 2 ======= b replaces both lines >>>>>>>\x20 end context """ self._test_merge_from_strings(base, a, b, result) class TestWeaveMerge(TestCaseWithMemoryTransport, MergeCasesMixin): def get_file(self, name="foo"): return WeaveFile(name, self.get_transport(), create=True) def log_contents(self, w): self.log("weave is:") tmpf = BytesIO() write_weave(w, tmpf) self.log(tmpf.getvalue()) overlapped_insert_expected = [ b"aaa", b"<<<<<<< ", b"xxx", b"yyy", b"=======", b"xxx", b">>>>>>> ", b"bbb", ] class TestContentFactoryAdaption(TestCaseWithMemoryTransport): def test_select_adaptor(self): """Test expected adapters exist.""" # One scenario for each lookup combination we expect to use. # Each is source_kind, requested_kind, adapter class scenarios = [ ("knit-delta-gz", "fulltext", _mod_knit.DeltaPlainToFullText), ("knit-delta-gz", "lines", _mod_knit.DeltaPlainToFullText), ("knit-delta-gz", "chunked", _mod_knit.DeltaPlainToFullText), ("knit-ft-gz", "fulltext", _mod_knit.FTPlainToFullText), ("knit-ft-gz", "lines", _mod_knit.FTPlainToFullText), ("knit-ft-gz", "chunked", _mod_knit.FTPlainToFullText), ( "knit-annotated-delta-gz", "knit-delta-gz", _mod_knit.DeltaAnnotatedToUnannotated, ), ("knit-annotated-delta-gz", "fulltext", _mod_knit.DeltaAnnotatedToFullText), ("knit-annotated-ft-gz", "knit-ft-gz", _mod_knit.FTAnnotatedToUnannotated), ("knit-annotated-ft-gz", "fulltext", _mod_knit.FTAnnotatedToFullText), ("knit-annotated-ft-gz", "lines", _mod_knit.FTAnnotatedToFullText), ("knit-annotated-ft-gz", "chunked", _mod_knit.FTAnnotatedToFullText), ] for source, requested, klass in scenarios: adapter_factory = versionedfile.adapter_registry.get((source, requested)) adapter = adapter_factory(None) self.assertIsInstance(adapter, klass) def get_knit(self, annotated=True): mapper = ConstantMapper("knit") transport = self.get_transport() return make_file_factory(annotated, mapper)(transport) def helpGetBytes(self, f, ft_name, ft_adapter, delta_name, delta_adapter): """Grab the interested adapted texts for tests.""" # origin is a fulltext entries = f.get_record_stream([(b"origin",)], "unordered", False) base = next(entries) ft_data = ft_adapter.get_bytes(base, ft_name) # merged is both a delta and multiple parents. entries = f.get_record_stream([(b"merged",)], "unordered", False) merged = next(entries) delta_data = delta_adapter.get_bytes(merged, delta_name) return ft_data, delta_data def test_deannotation_noeol(self): """Test converting annotated knits to unannotated knits.""" # we need a full text, and a delta f = self.get_knit() get_diamond_files(f, 1, trailing_eol=False) ft_data, delta_data = self.helpGetBytes( f, "knit-ft-gz", _mod_knit.FTAnnotatedToUnannotated(None), "knit-delta-gz", _mod_knit.DeltaAnnotatedToUnannotated(None), ) self.assertEqual( b"version origin 1 b284f94827db1fa2970d9e2014f080413b547a7e\n" b"origin\n" b"end origin\n", GzipFile(mode="rb", fileobj=BytesIO(ft_data)).read(), ) self.assertEqual( b"version merged 4 32c2e79763b3f90e8ccde37f9710b6629c25a796\n" b"1,2,3\nleft\nright\nmerged\nend merged\n", GzipFile(mode="rb", fileobj=BytesIO(delta_data)).read(), ) def test_deannotation(self): """Test converting annotated knits to unannotated knits.""" # we need a full text, and a delta f = self.get_knit() get_diamond_files(f, 1) ft_data, delta_data = self.helpGetBytes( f, "knit-ft-gz", _mod_knit.FTAnnotatedToUnannotated(None), "knit-delta-gz", _mod_knit.DeltaAnnotatedToUnannotated(None), ) self.assertEqual( b"version origin 1 00e364d235126be43292ab09cb4686cf703ddc17\n" b"origin\n" b"end origin\n", GzipFile(mode="rb", fileobj=BytesIO(ft_data)).read(), ) self.assertEqual( b"version merged 3 ed8bce375198ea62444dc71952b22cfc2b09226d\n" b"2,2,2\nright\nmerged\nend merged\n", GzipFile(mode="rb", fileobj=BytesIO(delta_data)).read(), ) def test_annotated_to_fulltext_no_eol(self): """Test adapting annotated knits to full texts (for -> weaves).""" # we need a full text, and a delta f = self.get_knit() get_diamond_files(f, 1, trailing_eol=False) # Reconstructing a full text requires a backing versioned file, and it # must have the base lines requested from it. logged_vf = versionedfile.RecordingVersionedFilesDecorator(f) ft_data, delta_data = self.helpGetBytes( f, "fulltext", _mod_knit.FTAnnotatedToFullText(None), "fulltext", _mod_knit.DeltaAnnotatedToFullText(logged_vf), ) self.assertEqual(b"origin", ft_data) self.assertEqual(b"base\nleft\nright\nmerged", delta_data) self.assertEqual( [("get_record_stream", [(b"left",)], "unordered", True)], logged_vf.calls ) def test_annotated_to_fulltext(self): """Test adapting annotated knits to full texts (for -> weaves).""" # we need a full text, and a delta f = self.get_knit() get_diamond_files(f, 1) # Reconstructing a full text requires a backing versioned file, and it # must have the base lines requested from it. logged_vf = versionedfile.RecordingVersionedFilesDecorator(f) ft_data, delta_data = self.helpGetBytes( f, "fulltext", _mod_knit.FTAnnotatedToFullText(None), "fulltext", _mod_knit.DeltaAnnotatedToFullText(logged_vf), ) self.assertEqual(b"origin\n", ft_data) self.assertEqual(b"base\nleft\nright\nmerged\n", delta_data) self.assertEqual( [("get_record_stream", [(b"left",)], "unordered", True)], logged_vf.calls ) def test_unannotated_to_fulltext(self): """Test adapting unannotated knits to full texts. This is used for -> weaves, and for -> annotated knits. """ # we need a full text, and a delta f = self.get_knit(annotated=False) get_diamond_files(f, 1) # Reconstructing a full text requires a backing versioned file, and it # must have the base lines requested from it. logged_vf = versionedfile.RecordingVersionedFilesDecorator(f) ft_data, delta_data = self.helpGetBytes( f, "fulltext", _mod_knit.FTPlainToFullText(None), "fulltext", _mod_knit.DeltaPlainToFullText(logged_vf), ) self.assertEqual(b"origin\n", ft_data) self.assertEqual(b"base\nleft\nright\nmerged\n", delta_data) self.assertEqual( [("get_record_stream", [(b"left",)], "unordered", True)], logged_vf.calls ) def test_unannotated_to_fulltext_no_eol(self): """Test adapting unannotated knits to full texts. This is used for -> weaves, and for -> annotated knits. """ # we need a full text, and a delta f = self.get_knit(annotated=False) get_diamond_files(f, 1, trailing_eol=False) # Reconstructing a full text requires a backing versioned file, and it # must have the base lines requested from it. logged_vf = versionedfile.RecordingVersionedFilesDecorator(f) ft_data, delta_data = self.helpGetBytes( f, "fulltext", _mod_knit.FTPlainToFullText(None), "fulltext", _mod_knit.DeltaPlainToFullText(logged_vf), ) self.assertEqual(b"origin", ft_data) self.assertEqual(b"base\nleft\nright\nmerged", delta_data) self.assertEqual( [("get_record_stream", [(b"left",)], "unordered", True)], logged_vf.calls ) class TestKeyMapper(TestCaseWithMemoryTransport): """Tests for various key mapping logic.""" def test_identity_mapper(self): mapper = versionedfile.ConstantMapper("inventory") self.assertEqual("inventory", mapper.map((b"foo@ar",))) self.assertEqual("inventory", mapper.map((b"quux",))) def test_prefix_mapper(self): # format5: plain mapper = versionedfile.PrefixMapper() self.assertEqual("file-id", mapper.map((b"file-id", b"revision-id"))) self.assertEqual("new-id", mapper.map((b"new-id", b"revision-id"))) self.assertEqual((b"file-id",), mapper.unmap("file-id")) self.assertEqual((b"new-id",), mapper.unmap("new-id")) def test_hash_prefix_mapper(self): # format6: hash + plain mapper = versionedfile.HashPrefixMapper() self.assertEqual("9b/file-id", mapper.map((b"file-id", b"revision-id"))) self.assertEqual("45/new-id", mapper.map((b"new-id", b"revision-id"))) self.assertEqual((b"file-id",), mapper.unmap("9b/file-id")) self.assertEqual((b"new-id",), mapper.unmap("45/new-id")) def test_hash_escaped_mapper(self): # knit1: hash + escaped mapper = versionedfile.HashEscapedPrefixMapper() self.assertEqual("88/%2520", mapper.map((b" ", b"revision-id"))) self.assertEqual("ed/fil%2545-%2549d", mapper.map((b"filE-Id", b"revision-id"))) self.assertEqual("88/ne%2557-%2549d", mapper.map((b"neW-Id", b"revision-id"))) self.assertEqual((b"filE-Id",), mapper.unmap("ed/fil%2545-%2549d")) self.assertEqual((b"neW-Id",), mapper.unmap("88/ne%2557-%2549d")) class TestVersionedFiles(TestCaseWithMemoryTransport): """Tests for the multiple-file variant of VersionedFile.""" # We want to be sure of behaviour for: # weaves prefix layout (weave texts) # individually named weaves (weave inventories) # annotated knits - prefix|hash|hash-escape layout, we test the third only # as it is the most complex mapper. # individually named knits # individual no-graph knits in packs (signatures) # individual graph knits in packs (inventories) # individual graph nocompression knits in packs (revisions) # plain text knits in packs (texts) len_one_scenarios = [ ( "weave-named", { "cleanup": None, "factory": make_versioned_files_factory( WeaveFile, ConstantMapper("inventory") ), "graph": True, "key_length": 1, "support_partial_insertion": False, }, ), ( "named-knit", { "cleanup": None, "factory": make_file_factory(False, ConstantMapper("revisions")), "graph": True, "key_length": 1, "support_partial_insertion": False, }, ), ( "named-nograph-nodelta-knit-pack", { "cleanup": cleanup_pack_knit, "factory": make_pack_factory(False, False, 1), "graph": False, "key_length": 1, "support_partial_insertion": False, }, ), ( "named-graph-knit-pack", { "cleanup": cleanup_pack_knit, "factory": make_pack_factory(True, True, 1), "graph": True, "key_length": 1, "support_partial_insertion": True, }, ), ( "named-graph-nodelta-knit-pack", { "cleanup": cleanup_pack_knit, "factory": make_pack_factory(True, False, 1), "graph": True, "key_length": 1, "support_partial_insertion": False, }, ), ( "groupcompress-nograph", { "cleanup": groupcompress.cleanup_pack_group, "factory": groupcompress.make_pack_factory(False, False, 1), "graph": False, "key_length": 1, "support_partial_insertion": False, }, ), ] len_two_scenarios = [ ( "weave-prefix", { "cleanup": None, "factory": make_versioned_files_factory(WeaveFile, PrefixMapper()), "graph": True, "key_length": 2, "support_partial_insertion": False, }, ), ( "annotated-knit-escape", { "cleanup": None, "factory": make_file_factory(True, HashEscapedPrefixMapper()), "graph": True, "key_length": 2, "support_partial_insertion": False, }, ), ( "plain-knit-pack", { "cleanup": cleanup_pack_knit, "factory": make_pack_factory(True, True, 2), "graph": True, "key_length": 2, "support_partial_insertion": True, }, ), ( "groupcompress", { "cleanup": groupcompress.cleanup_pack_group, "factory": groupcompress.make_pack_factory(True, False, 1), "graph": True, "key_length": 1, "support_partial_insertion": False, }, ), ] scenarios = len_one_scenarios + len_two_scenarios def get_versionedfiles(self, relpath="files"): transport = self.get_transport(relpath) if relpath != ".": transport.mkdir(".") files = self.factory(transport) if self.cleanup is not None: self.addCleanup(self.cleanup, files) return files def get_simple_key(self, suffix): """Return a key for the object under test.""" if self.key_length == 1: return (suffix,) else: return (b"FileA",) + (suffix,) def test_add_fallback_implies_without_fallbacks(self): f = self.get_versionedfiles("files") if getattr(f, "add_fallback_versioned_files", None) is None: raise TestNotApplicable(f"{f.__class__.__name__} doesn't support fallbacks") g = self.get_versionedfiles("fallback") key_a = self.get_simple_key(b"a") g.add_lines(key_a, [], [b"\n"]) f.add_fallback_versioned_files(g) self.assertTrue(key_a in f.get_parent_map([key_a])) self.assertFalse(key_a in f.without_fallbacks().get_parent_map([key_a])) def test_add_lines(self): f = self.get_versionedfiles() key0 = self.get_simple_key(b"r0") key1 = self.get_simple_key(b"r1") self.get_simple_key(b"r2") self.get_simple_key(b"foo") f.add_lines(key0, [], [b"a\n", b"b\n"]) if self.graph: f.add_lines(key1, [key0], [b"b\n", b"c\n"]) else: f.add_lines(key1, [], [b"b\n", b"c\n"]) keys = f.keys() self.assertTrue(key0 in keys) self.assertTrue(key1 in keys) records = [] for record in f.get_record_stream([key0, key1], "unordered", True): records.append((record.key, record.get_bytes_as("fulltext"))) records.sort() self.assertEqual([(key0, b"a\nb\n"), (key1, b"b\nc\n")], records) def test_add_chunks(self): f = self.get_versionedfiles() key0 = self.get_simple_key(b"r0") key1 = self.get_simple_key(b"r1") self.get_simple_key(b"r2") self.get_simple_key(b"foo") def add_chunks(key, parents, chunks): factory = ChunkedContentFactory( key, parents, osutils.sha_strings(chunks), chunks ) return f.add_content(factory) add_chunks(key0, [], [b"a", b"\nb\n"]) if self.graph: add_chunks(key1, [key0], [b"b", b"\n", b"c\n"]) else: add_chunks(key1, [], [b"b\n", b"c\n"]) keys = f.keys() self.assertIn(key0, keys) self.assertIn(key1, keys) records = [] for record in f.get_record_stream([key0, key1], "unordered", True): records.append((record.key, record.get_bytes_as("fulltext"))) records.sort() self.assertEqual([(key0, b"a\nb\n"), (key1, b"b\nc\n")], records) def test_annotate(self): files = self.get_versionedfiles() self.get_diamond_files(files) prefix = () if self.key_length == 1 else (b"FileA",) # introduced full text origins = files.annotate(prefix + (b"origin",)) self.assertEqual([(prefix + (b"origin",), b"origin\n")], origins) # a delta origins = files.annotate(prefix + (b"base",)) self.assertEqual([(prefix + (b"base",), b"base\n")], origins) # a merge origins = files.annotate(prefix + (b"merged",)) if self.graph: self.assertEqual( [ (prefix + (b"base",), b"base\n"), (prefix + (b"left",), b"left\n"), (prefix + (b"right",), b"right\n"), (prefix + (b"merged",), b"merged\n"), ], origins, ) else: # Without a graph everything is new. self.assertEqual( [ (prefix + (b"merged",), b"base\n"), (prefix + (b"merged",), b"left\n"), (prefix + (b"merged",), b"right\n"), (prefix + (b"merged",), b"merged\n"), ], origins, ) self.assertRaises( RevisionNotPresent, files.annotate, prefix + (b"missing-key",) ) def test_check_no_parameters(self): self.get_versionedfiles() def test_check_progressbar_parameter(self): """A progress bar can be supplied because check can be a generator.""" class _DummyProgressBar: def update(self, *args): pass def finished(self): pass pb = _DummyProgressBar() files = self.get_versionedfiles() files.check(progress_bar=pb) def test_check_with_keys_becomes_generator(self): files = self.get_versionedfiles() self.get_diamond_files(files) keys = files.keys() entries = files.check(keys=keys) seen = set() # Texts output should be fulltexts. self.capture_stream( files, entries, seen.add, files.get_parent_map(keys), require_fulltext=True ) # All texts should be output. self.assertEqual(set(keys), seen) def test_clear_cache(self): files = self.get_versionedfiles() files.clear_cache() def test_construct(self): """Each parameterised test can be constructed on a transport.""" self.get_versionedfiles() def get_diamond_files( self, files, trailing_eol=True, left_only=False, nokeys=False ): return get_diamond_files( files, self.key_length, trailing_eol=trailing_eol, nograph=not self.graph, left_only=left_only, nokeys=nokeys, ) def _add_content_nostoresha(self, add_lines): """When nostore_sha is supplied using old content raises.""" vf = self.get_versionedfiles() empty_text = (b"a", []) sample_text_nl = (b"b", [b"foo\n", b"bar\n"]) sample_text_no_nl = (b"c", [b"foo\n", b"bar"]) shas = [] for version, lines in (empty_text, sample_text_nl, sample_text_no_nl): if add_lines: sha, _, _ = vf.add_lines(self.get_simple_key(version), [], lines) else: sha, _, _ = vf.add_lines(self.get_simple_key(version), [], lines) shas.append(sha) # we now have a copy of all the lines in the vf. for sha, (version, lines) in zip( shas, (empty_text, sample_text_nl, sample_text_no_nl), strict=False ): new_key = self.get_simple_key(version + b"2") self.assertRaises( ExistingContent, vf.add_lines, new_key, [], lines, nostore_sha=sha ) self.assertRaises( ExistingContent, vf.add_lines, new_key, [], lines, nostore_sha=sha ) # and no new version should have been added. record = next(vf.get_record_stream([new_key], "unordered", True)) self.assertEqual("absent", record.storage_kind) def test_add_lines_nostoresha(self): self._add_content_nostoresha(add_lines=True) def test_add_lines_return(self): files = self.get_versionedfiles() # save code by using the stock data insertion helper. adds = self.get_diamond_files(files) results = [] # We can only validate the first 2 elements returned from add_lines. for add in adds: self.assertEqual(3, len(add)) results.append(add[:2]) if self.key_length == 1: self.assertEqual( [ (b"00e364d235126be43292ab09cb4686cf703ddc17", 7), (b"51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44", 5), (b"a8478686da38e370e32e42e8a0c220e33ee9132f", 10), (b"9ef09dfa9d86780bdec9219a22560c6ece8e0ef1", 11), (b"ed8bce375198ea62444dc71952b22cfc2b09226d", 23), ], results, ) elif self.key_length == 2: self.assertEqual( [ (b"00e364d235126be43292ab09cb4686cf703ddc17", 7), (b"00e364d235126be43292ab09cb4686cf703ddc17", 7), (b"51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44", 5), (b"51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44", 5), (b"a8478686da38e370e32e42e8a0c220e33ee9132f", 10), (b"a8478686da38e370e32e42e8a0c220e33ee9132f", 10), (b"9ef09dfa9d86780bdec9219a22560c6ece8e0ef1", 11), (b"9ef09dfa9d86780bdec9219a22560c6ece8e0ef1", 11), (b"ed8bce375198ea62444dc71952b22cfc2b09226d", 23), (b"ed8bce375198ea62444dc71952b22cfc2b09226d", 23), ], results, ) def test_add_lines_no_key_generates_chk_key(self): files = self.get_versionedfiles() # save code by using the stock data insertion helper. adds = self.get_diamond_files(files, nokeys=True) results = [] # We can only validate the first 2 elements returned from add_lines. for add in adds: self.assertEqual(3, len(add)) results.append(add[:2]) if self.key_length == 1: self.assertEqual( [ (b"00e364d235126be43292ab09cb4686cf703ddc17", 7), (b"51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44", 5), (b"a8478686da38e370e32e42e8a0c220e33ee9132f", 10), (b"9ef09dfa9d86780bdec9219a22560c6ece8e0ef1", 11), (b"ed8bce375198ea62444dc71952b22cfc2b09226d", 23), ], results, ) # Check the added items got CHK keys. self.assertEqual( { (b"sha1:00e364d235126be43292ab09cb4686cf703ddc17",), (b"sha1:51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44",), (b"sha1:9ef09dfa9d86780bdec9219a22560c6ece8e0ef1",), (b"sha1:a8478686da38e370e32e42e8a0c220e33ee9132f",), (b"sha1:ed8bce375198ea62444dc71952b22cfc2b09226d",), }, files.keys(), ) elif self.key_length == 2: self.assertEqual( [ (b"00e364d235126be43292ab09cb4686cf703ddc17", 7), (b"00e364d235126be43292ab09cb4686cf703ddc17", 7), (b"51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44", 5), (b"51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44", 5), (b"a8478686da38e370e32e42e8a0c220e33ee9132f", 10), (b"a8478686da38e370e32e42e8a0c220e33ee9132f", 10), (b"9ef09dfa9d86780bdec9219a22560c6ece8e0ef1", 11), (b"9ef09dfa9d86780bdec9219a22560c6ece8e0ef1", 11), (b"ed8bce375198ea62444dc71952b22cfc2b09226d", 23), (b"ed8bce375198ea62444dc71952b22cfc2b09226d", 23), ], results, ) # Check the added items got CHK keys. self.assertEqual( { (b"FileA", b"sha1:00e364d235126be43292ab09cb4686cf703ddc17"), (b"FileA", b"sha1:51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44"), (b"FileA", b"sha1:9ef09dfa9d86780bdec9219a22560c6ece8e0ef1"), (b"FileA", b"sha1:a8478686da38e370e32e42e8a0c220e33ee9132f"), (b"FileA", b"sha1:ed8bce375198ea62444dc71952b22cfc2b09226d"), (b"FileB", b"sha1:00e364d235126be43292ab09cb4686cf703ddc17"), (b"FileB", b"sha1:51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44"), (b"FileB", b"sha1:9ef09dfa9d86780bdec9219a22560c6ece8e0ef1"), (b"FileB", b"sha1:a8478686da38e370e32e42e8a0c220e33ee9132f"), (b"FileB", b"sha1:ed8bce375198ea62444dc71952b22cfc2b09226d"), }, files.keys(), ) def test_empty_lines(self): """Empty files can be stored.""" f = self.get_versionedfiles() key_a = self.get_simple_key(b"a") f.add_lines(key_a, [], []) self.assertEqual( b"", next(f.get_record_stream([key_a], "unordered", True)).get_bytes_as( "fulltext" ), ) key_b = self.get_simple_key(b"b") f.add_lines(key_b, self.get_parents([key_a]), []) self.assertEqual( b"", next(f.get_record_stream([key_b], "unordered", True)).get_bytes_as( "fulltext" ), ) def test_newline_only(self): f = self.get_versionedfiles() key_a = self.get_simple_key(b"a") f.add_lines(key_a, [], [b"\n"]) self.assertEqual( b"\n", next(f.get_record_stream([key_a], "unordered", True)).get_bytes_as( "fulltext" ), ) key_b = self.get_simple_key(b"b") f.add_lines(key_b, self.get_parents([key_a]), [b"\n"]) self.assertEqual( b"\n", next(f.get_record_stream([key_b], "unordered", True)).get_bytes_as( "fulltext" ), ) def test_get_known_graph_ancestry(self): f = self.get_versionedfiles() if not self.graph: raise TestNotApplicable("ancestry info only relevant with graph.") key_a = self.get_simple_key(b"a") key_b = self.get_simple_key(b"b") key_c = self.get_simple_key(b"c") # A # |\ # | B # |/ # C f.add_lines(key_a, [], [b"\n"]) f.add_lines(key_b, [key_a], [b"\n"]) f.add_lines(key_c, [key_a, key_b], [b"\n"]) kg = f.get_known_graph_ancestry([key_c]) self.assertIsInstance(kg, _mod_known_graph.KnownGraph) self.assertEqual([key_a, key_b, key_c], list(kg.topo_sort())) def test_known_graph_with_fallbacks(self): f = self.get_versionedfiles("files") if not self.graph: raise TestNotApplicable("ancestry info only relevant with graph.") if getattr(f, "add_fallback_versioned_files", None) is None: raise TestNotApplicable(f"{f.__class__.__name__} doesn't support fallbacks") key_a = self.get_simple_key(b"a") key_b = self.get_simple_key(b"b") key_c = self.get_simple_key(b"c") # A only in fallback # |\ # | B # |/ # C g = self.get_versionedfiles("fallback") g.add_lines(key_a, [], [b"\n"]) f.add_fallback_versioned_files(g) f.add_lines(key_b, [key_a], [b"\n"]) f.add_lines(key_c, [key_a, key_b], [b"\n"]) kg = f.get_known_graph_ancestry([key_c]) self.assertEqual([key_a, key_b, key_c], list(kg.topo_sort())) def test_get_record_stream_empty(self): """An empty stream can be requested without error.""" f = self.get_versionedfiles() entries = f.get_record_stream([], "unordered", False) self.assertEqual([], list(entries)) def assertValidStorageKind(self, storage_kind): """Assert that storage_kind is a valid storage_kind.""" self.assertSubset( [storage_kind], [ "mpdiff", "knit-annotated-ft", "knit-annotated-delta", "knit-ft", "knit-delta", "chunked", "fulltext", "knit-annotated-ft-gz", "knit-annotated-delta-gz", "knit-ft-gz", "knit-delta-gz", "knit-delta-closure", "knit-delta-closure-ref", "groupcompress-block", "groupcompress-block-ref", ], ) def capture_stream(self, f, entries, on_seen, parents, require_fulltext=False): """Capture a stream for testing.""" for factory in entries: on_seen(factory.key) self.assertValidStorageKind(factory.storage_kind) if factory.sha1 is not None: self.assertEqual(f.get_sha1s([factory.key])[factory.key], factory.sha1) self.assertEqual(parents[factory.key], factory.parents) self.assertIsInstance(factory.get_bytes_as(factory.storage_kind), bytes) if require_fulltext: factory.get_bytes_as("fulltext") def test_get_record_stream_interface(self): """Each item in a stream has to provide a regular interface.""" files = self.get_versionedfiles() self.get_diamond_files(files) keys, _ = self.get_keys_and_sort_order() parent_map = files.get_parent_map(keys) entries = files.get_record_stream(keys, "unordered", False) seen = set() self.capture_stream(files, entries, seen.add, parent_map) self.assertEqual(set(keys), seen) def get_keys_and_sort_order(self): """Get diamond test keys list, and their sort ordering.""" if self.key_length == 1: keys = [(b"merged",), (b"left",), (b"right",), (b"base",)] sort_order = {(b"merged",): 2, (b"left",): 1, (b"right",): 1, (b"base",): 0} else: keys = [ (b"FileA", b"merged"), (b"FileA", b"left"), (b"FileA", b"right"), (b"FileA", b"base"), (b"FileB", b"merged"), (b"FileB", b"left"), (b"FileB", b"right"), (b"FileB", b"base"), ] sort_order = { (b"FileA", b"merged"): 2, (b"FileA", b"left"): 1, (b"FileA", b"right"): 1, (b"FileA", b"base"): 0, (b"FileB", b"merged"): 2, (b"FileB", b"left"): 1, (b"FileB", b"right"): 1, (b"FileB", b"base"): 0, } return keys, sort_order def get_keys_and_groupcompress_sort_order(self): """Get diamond test keys list, and their groupcompress sort ordering.""" if self.key_length == 1: keys = [(b"merged",), (b"left",), (b"right",), (b"base",)] sort_order = {(b"merged",): 0, (b"left",): 1, (b"right",): 1, (b"base",): 2} else: keys = [ (b"FileA", b"merged"), (b"FileA", b"left"), (b"FileA", b"right"), (b"FileA", b"base"), (b"FileB", b"merged"), (b"FileB", b"left"), (b"FileB", b"right"), (b"FileB", b"base"), ] sort_order = { (b"FileA", b"merged"): 0, (b"FileA", b"left"): 1, (b"FileA", b"right"): 1, (b"FileA", b"base"): 2, (b"FileB", b"merged"): 3, (b"FileB", b"left"): 4, (b"FileB", b"right"): 4, (b"FileB", b"base"): 5, } return keys, sort_order def test_get_record_stream_interface_ordered(self): """Each item in a stream has to provide a regular interface.""" files = self.get_versionedfiles() self.get_diamond_files(files) keys, sort_order = self.get_keys_and_sort_order() parent_map = files.get_parent_map(keys) entries = files.get_record_stream(keys, "topological", False) seen = [] self.capture_stream(files, entries, seen.append, parent_map) self.assertStreamOrder(sort_order, seen, keys) def test_get_record_stream_interface_ordered_with_delta_closure(self): """Each item must be accessible as a fulltext.""" files = self.get_versionedfiles() self.get_diamond_files(files) keys, sort_order = self.get_keys_and_sort_order() parent_map = files.get_parent_map(keys) entries = files.get_record_stream(keys, "topological", True) seen = [] for factory in entries: seen.append(factory.key) self.assertValidStorageKind(factory.storage_kind) self.assertSubset( [factory.sha1], [None, files.get_sha1s([factory.key])[factory.key]] ) self.assertEqual(parent_map[factory.key], factory.parents) # self.assertEqual(files.get_text(factory.key), ft_bytes = factory.get_bytes_as("fulltext") self.assertIsInstance(ft_bytes, bytes) chunked_bytes = factory.get_bytes_as("chunked") self.assertEqualDiff(ft_bytes, b"".join(chunked_bytes)) chunked_bytes = factory.iter_bytes_as("chunked") self.assertEqualDiff(ft_bytes, b"".join(chunked_bytes)) self.assertStreamOrder(sort_order, seen, keys) def test_get_record_stream_interface_groupcompress(self): """Each item in a stream has to provide a regular interface.""" files = self.get_versionedfiles() self.get_diamond_files(files) keys, sort_order = self.get_keys_and_groupcompress_sort_order() parent_map = files.get_parent_map(keys) entries = files.get_record_stream(keys, "groupcompress", False) seen = [] self.capture_stream(files, entries, seen.append, parent_map) self.assertStreamOrder(sort_order, seen, keys) def assertStreamOrder(self, sort_order, seen, keys): self.assertEqual(len(set(seen)), len(keys)) lows = {(): 0} if self.key_length == 1 else {(b"FileA",): 0, (b"FileB",): 0} if not self.graph: self.assertEqual(set(keys), set(seen)) else: for key in seen: sort_pos = sort_order[key] self.assertTrue( sort_pos >= lows[key[:-1]], f"Out of order in sorted stream: {key!r}, {seen!r}", ) lows[key[:-1]] = sort_pos def test_get_record_stream_unknown_storage_kind_raises(self): """Asking for a storage kind that the stream cannot supply raises.""" files = self.get_versionedfiles() self.get_diamond_files(files) if self.key_length == 1: keys = [(b"merged",), (b"left",), (b"right",), (b"base",)] else: keys = [ (b"FileA", b"merged"), (b"FileA", b"left"), (b"FileA", b"right"), (b"FileA", b"base"), (b"FileB", b"merged"), (b"FileB", b"left"), (b"FileB", b"right"), (b"FileB", b"base"), ] parent_map = files.get_parent_map(keys) entries = files.get_record_stream(keys, "unordered", False) # We track the contents because we should be able to try, fail a # particular kind and then ask for one that works and continue. seen = set() for factory in entries: seen.add(factory.key) self.assertValidStorageKind(factory.storage_kind) if factory.sha1 is not None: self.assertEqual( files.get_sha1s([factory.key])[factory.key], factory.sha1 ) self.assertEqual(parent_map[factory.key], factory.parents) # currently no stream emits mpdiff self.assertRaises(UnavailableRepresentation, factory.get_bytes_as, "mpdiff") self.assertIsInstance(factory.get_bytes_as(factory.storage_kind), bytes) self.assertEqual(set(keys), seen) def test_get_record_stream_missing_records_are_absent(self): files = self.get_versionedfiles() self.get_diamond_files(files) if self.key_length == 1: keys = [(b"merged",), (b"left",), (b"right",), (b"absent",), (b"base",)] else: keys = [ (b"FileA", b"merged"), (b"FileA", b"left"), (b"FileA", b"right"), (b"FileA", b"absent"), (b"FileA", b"base"), (b"FileB", b"merged"), (b"FileB", b"left"), (b"FileB", b"right"), (b"FileB", b"absent"), (b"FileB", b"base"), (b"absent", b"absent"), ] parent_map = files.get_parent_map(keys) entries = files.get_record_stream(keys, "unordered", False) self.assertAbsentRecord(files, keys, parent_map, entries) entries = files.get_record_stream(keys, "topological", False) self.assertAbsentRecord(files, keys, parent_map, entries) def assertRecordHasContent(self, record, bytes): """Assert that record has the bytes bytes.""" self.assertEqual(bytes, record.get_bytes_as("fulltext")) self.assertEqual(bytes, b"".join(record.get_bytes_as("chunked"))) def test_get_record_stream_native_formats_are_wire_ready_one_ft(self): files = self.get_versionedfiles() key = self.get_simple_key(b"foo") files.add_lines(key, (), [b"my text\n", b"content"]) stream = files.get_record_stream([key], "unordered", False) record = next(stream) if record.storage_kind in ("chunked", "fulltext"): # chunked and fulltext representations are for direct use not wire # serialisation: check they are able to be used directly. To send # such records over the wire translation will be needed. self.assertRecordHasContent(record, b"my text\ncontent") else: bytes = [record.get_bytes_as(record.storage_kind)] network_stream = versionedfile.NetworkRecordStream(bytes).read() source_record = record records = [] for record in network_stream: records.append(record) self.assertEqual(source_record.storage_kind, record.storage_kind) self.assertEqual(source_record.parents, record.parents) self.assertEqual( source_record.get_bytes_as(source_record.storage_kind), record.get_bytes_as(record.storage_kind), ) self.assertEqual(1, len(records)) def assertStreamMetaEqual(self, records, expected, stream): """Assert that streams expected and stream have the same records. :param records: A list to collect the seen records. :return: A generator of the records in stream. """ # We make assertions during copying to catch things early for easier # debugging. This must use the iterating zip() from the future. for record, ref_record in zip(stream, expected, strict=False): records.append(record) self.assertEqual(ref_record.key, record.key) self.assertEqual(ref_record.storage_kind, record.storage_kind) self.assertEqual(ref_record.parents, record.parents) yield record def stream_to_bytes_or_skip_counter(self, skipped_records, full_texts, stream): """Convert a stream to a bytes iterator. :param skipped_records: A list with one element to increment when a record is skipped. :param full_texts: A dict from key->fulltext representation, for checking chunked or fulltext stored records. :param stream: A record_stream. :return: An iterator over the bytes of each record. """ for record in stream: if record.storage_kind in ("chunked", "fulltext"): skipped_records[0] += 1 # check the content is correct for direct use. self.assertRecordHasContent(record, full_texts[record.key]) else: yield record.get_bytes_as(record.storage_kind) def test_get_record_stream_native_formats_are_wire_ready_ft_delta(self): files = self.get_versionedfiles() target_files = self.get_versionedfiles("target") key = self.get_simple_key(b"ft") key_delta = self.get_simple_key(b"delta") files.add_lines(key, (), [b"my text\n", b"content"]) delta_parents = (key,) if self.graph else () files.add_lines(key_delta, delta_parents, [b"different\n", b"content\n"]) local = files.get_record_stream([key, key_delta], "unordered", False) ref = files.get_record_stream([key, key_delta], "unordered", False) skipped_records = [0] full_texts = { key: b"my text\ncontent", key_delta: b"different\ncontent\n", } byte_stream = self.stream_to_bytes_or_skip_counter( skipped_records, full_texts, local ) network_stream = versionedfile.NetworkRecordStream(byte_stream).read() records = [] # insert the stream from the network into a versioned files object so we can # check the content was carried across correctly without doing delta # inspection. target_files.insert_record_stream( self.assertStreamMetaEqual(records, ref, network_stream) ) # No duplicates on the wire thank you! self.assertEqual(2, len(records) + skipped_records[0]) if len(records): # if any content was copied it all must have all been. self.assertIdenticalVersionedFile(files, target_files) def test_get_record_stream_native_formats_are_wire_ready_delta(self): # copy a delta over the wire files = self.get_versionedfiles() target_files = self.get_versionedfiles("target") key = self.get_simple_key(b"ft") key_delta = self.get_simple_key(b"delta") files.add_lines(key, (), [b"my text\n", b"content"]) delta_parents = (key,) if self.graph else () files.add_lines(key_delta, delta_parents, [b"different\n", b"content\n"]) # Copy the basis text across so we can reconstruct the delta during # insertion into target. target_files.insert_record_stream( files.get_record_stream([key], "unordered", False) ) local = files.get_record_stream([key_delta], "unordered", False) ref = files.get_record_stream([key_delta], "unordered", False) skipped_records = [0] full_texts = { key_delta: b"different\ncontent\n", } byte_stream = self.stream_to_bytes_or_skip_counter( skipped_records, full_texts, local ) network_stream = versionedfile.NetworkRecordStream(byte_stream).read() records = [] # insert the stream from the network into a versioned files object so we can # check the content was carried across correctly without doing delta # inspection during check_stream. target_files.insert_record_stream( self.assertStreamMetaEqual(records, ref, network_stream) ) # No duplicates on the wire thank you! self.assertEqual(1, len(records) + skipped_records[0]) if len(records): # if any content was copied it all must have all been self.assertIdenticalVersionedFile(files, target_files) def test_get_record_stream_wire_ready_delta_closure_included(self): # copy a delta over the wire with the ability to get its full text. files = self.get_versionedfiles() key = self.get_simple_key(b"ft") key_delta = self.get_simple_key(b"delta") files.add_lines(key, (), [b"my text\n", b"content"]) delta_parents = (key,) if self.graph else () files.add_lines(key_delta, delta_parents, [b"different\n", b"content\n"]) local = files.get_record_stream([key_delta], "unordered", True) ref = files.get_record_stream([key_delta], "unordered", True) skipped_records = [0] full_texts = { key_delta: b"different\ncontent\n", } byte_stream = self.stream_to_bytes_or_skip_counter( skipped_records, full_texts, local ) network_stream = versionedfile.NetworkRecordStream(byte_stream).read() records = [] # insert the stream from the network into a versioned files object so we can # check the content was carried across correctly without doing delta # inspection during check_stream. for record in self.assertStreamMetaEqual(records, ref, network_stream): # we have to be able to get the full text out: self.assertRecordHasContent(record, full_texts[record.key]) # No duplicates on the wire thank you! self.assertEqual(1, len(records) + skipped_records[0]) def assertAbsentRecord(self, files, keys, parents, entries): """Helper for test_get_record_stream_missing_records_are_absent.""" seen = set() for factory in entries: seen.add(factory.key) if factory.key[-1] == b"absent": self.assertEqual("absent", factory.storage_kind) self.assertEqual(None, factory.sha1) self.assertEqual(None, factory.parents) else: self.assertValidStorageKind(factory.storage_kind) if factory.sha1 is not None: sha1 = files.get_sha1s([factory.key])[factory.key] self.assertEqual(sha1, factory.sha1) self.assertEqual(parents[factory.key], factory.parents) self.assertIsInstance(factory.get_bytes_as(factory.storage_kind), bytes) self.assertEqual(set(keys), seen) def test_filter_absent_records(self): """Requested missing records can be filter trivially.""" files = self.get_versionedfiles() self.get_diamond_files(files) keys, _ = self.get_keys_and_sort_order() parent_map = files.get_parent_map(keys) # Add an absent record in the middle of the present keys. (We don't ask # for just absent keys to ensure that content before and after the # absent keys is still delivered). present_keys = list(keys) if self.key_length == 1: keys.insert(2, (b"extra",)) else: keys.insert(2, (b"extra", b"extra")) entries = files.get_record_stream(keys, "unordered", False) seen = set() self.capture_stream( files, versionedfile.filter_absent(entries), seen.add, parent_map ) self.assertEqual(set(present_keys), seen) def get_mapper(self): """Get a mapper suitable for the key length of the test interface.""" if self.key_length == 1: return ConstantMapper("source") else: return HashEscapedPrefixMapper() def get_parents(self, parents): """Get parents, taking self.graph into consideration.""" if self.graph: return parents else: return None def test_get_annotator(self): files = self.get_versionedfiles() self.get_diamond_files(files) origin_key = self.get_simple_key(b"origin") base_key = self.get_simple_key(b"base") left_key = self.get_simple_key(b"left") right_key = self.get_simple_key(b"right") merged_key = self.get_simple_key(b"merged") # annotator = files.get_annotator() # introduced full text origins, lines = files.get_annotator().annotate(origin_key) self.assertEqual([(origin_key,)], origins) self.assertEqual([b"origin\n"], lines) # a delta origins, lines = files.get_annotator().annotate(base_key) self.assertEqual([(base_key,)], origins) # a merge origins, lines = files.get_annotator().annotate(merged_key) if self.graph: self.assertEqual( [ (base_key,), (left_key,), (right_key,), (merged_key,), ], origins, ) else: # Without a graph everything is new. self.assertEqual( [ (merged_key,), (merged_key,), (merged_key,), (merged_key,), ], origins, ) self.assertRaises( RevisionNotPresent, files.get_annotator().annotate, self.get_simple_key(b"missing-key"), ) def test_get_parent_map(self): files = self.get_versionedfiles() if self.key_length == 1: parent_details = [ ((b"r0",), self.get_parents(())), ((b"r1",), self.get_parents(((b"r0",),))), ((b"r2",), self.get_parents(())), ((b"r3",), self.get_parents(())), ((b"m",), self.get_parents(((b"r0",), (b"r1",), (b"r2",), (b"r3",)))), ] else: parent_details = [ ((b"FileA", b"r0"), self.get_parents(())), ((b"FileA", b"r1"), self.get_parents(((b"FileA", b"r0"),))), ((b"FileA", b"r2"), self.get_parents(())), ((b"FileA", b"r3"), self.get_parents(())), ( (b"FileA", b"m"), self.get_parents( ( (b"FileA", b"r0"), (b"FileA", b"r1"), (b"FileA", b"r2"), (b"FileA", b"r3"), ) ), ), ] for key, parents in parent_details: files.add_lines(key, parents, []) # immediately after adding it should be queryable. self.assertEqual({key: parents}, files.get_parent_map([key])) # We can ask for an empty set self.assertEqual({}, files.get_parent_map([])) # We can ask for many keys all_parents = dict(parent_details) self.assertEqual(all_parents, files.get_parent_map(all_parents.keys())) # Absent keys are just not included in the result. keys = list(all_parents.keys()) if self.key_length == 1: keys.insert(1, (b"missing",)) else: keys.insert(1, (b"missing", b"missing")) # Absent keys are just ignored self.assertEqual(all_parents, files.get_parent_map(keys)) def test_get_sha1s(self): files = self.get_versionedfiles() self.get_diamond_files(files) if self.key_length == 1: keys = [(b"base",), (b"origin",), (b"left",), (b"merged",), (b"right",)] else: # ask for shas from different prefixes. keys = [ (b"FileA", b"base"), (b"FileB", b"origin"), (b"FileA", b"left"), (b"FileA", b"merged"), (b"FileB", b"right"), ] self.assertEqual( { keys[0]: b"51c64a6f4fc375daf0d24aafbabe4d91b6f4bb44", keys[1]: b"00e364d235126be43292ab09cb4686cf703ddc17", keys[2]: b"a8478686da38e370e32e42e8a0c220e33ee9132f", keys[3]: b"ed8bce375198ea62444dc71952b22cfc2b09226d", keys[4]: b"9ef09dfa9d86780bdec9219a22560c6ece8e0ef1", }, files.get_sha1s(keys), ) def test_insert_record_stream_empty(self): """Inserting an empty record stream should work.""" files = self.get_versionedfiles() files.insert_record_stream([]) def assertIdenticalVersionedFile(self, expected, actual): """Assert that left and right have the same contents.""" self.assertEqual(set(actual.keys()), set(expected.keys())) actual_parents = actual.get_parent_map(actual.keys()) if self.graph: self.assertEqual(actual_parents, expected.get_parent_map(expected.keys())) else: for _key, parents in actual_parents.items(): self.assertEqual(None, parents) for key in actual.keys(): actual_text = next( actual.get_record_stream([key], "unordered", True) ).get_bytes_as("fulltext") expected_text = next( expected.get_record_stream([key], "unordered", True) ).get_bytes_as("fulltext") self.assertEqual(actual_text, expected_text) def test_insert_record_stream_fulltexts(self): """Any file should accept a stream of fulltexts.""" files = self.get_versionedfiles() mapper = self.get_mapper() source_transport = self.get_transport("source") source_transport.mkdir(".") # weaves always output fulltexts. source = make_versioned_files_factory(WeaveFile, mapper)(source_transport) self.get_diamond_files(source, trailing_eol=False) stream = source.get_record_stream(source.keys(), "topological", False) files.insert_record_stream(stream) self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_fulltexts_noeol(self): """Any file should accept a stream of fulltexts.""" files = self.get_versionedfiles() mapper = self.get_mapper() source_transport = self.get_transport("source") source_transport.mkdir(".") # weaves always output fulltexts. source = make_versioned_files_factory(WeaveFile, mapper)(source_transport) self.get_diamond_files(source, trailing_eol=False) stream = source.get_record_stream(source.keys(), "topological", False) files.insert_record_stream(stream) self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_annotated_knits(self): """Any file should accept a stream from plain knits.""" files = self.get_versionedfiles() mapper = self.get_mapper() source_transport = self.get_transport("source") source_transport.mkdir(".") source = make_file_factory(True, mapper)(source_transport) self.get_diamond_files(source) stream = source.get_record_stream(source.keys(), "topological", False) files.insert_record_stream(stream) self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_annotated_knits_noeol(self): """Any file should accept a stream from plain knits.""" files = self.get_versionedfiles() mapper = self.get_mapper() source_transport = self.get_transport("source") source_transport.mkdir(".") source = make_file_factory(True, mapper)(source_transport) self.get_diamond_files(source, trailing_eol=False) stream = source.get_record_stream(source.keys(), "topological", False) files.insert_record_stream(stream) self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_plain_knits(self): """Any file should accept a stream from plain knits.""" files = self.get_versionedfiles() mapper = self.get_mapper() source_transport = self.get_transport("source") source_transport.mkdir(".") source = make_file_factory(False, mapper)(source_transport) self.get_diamond_files(source) stream = source.get_record_stream(source.keys(), "topological", False) files.insert_record_stream(stream) self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_plain_knits_noeol(self): """Any file should accept a stream from plain knits.""" files = self.get_versionedfiles() mapper = self.get_mapper() source_transport = self.get_transport("source") source_transport.mkdir(".") source = make_file_factory(False, mapper)(source_transport) self.get_diamond_files(source, trailing_eol=False) stream = source.get_record_stream(source.keys(), "topological", False) files.insert_record_stream(stream) self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_existing_keys(self): """Inserting keys already in a file should not error.""" files = self.get_versionedfiles() source = self.get_versionedfiles("source") self.get_diamond_files(source) # insert some keys into f. self.get_diamond_files(files, left_only=True) stream = source.get_record_stream(source.keys(), "topological", False) files.insert_record_stream(stream) self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_missing_keys(self): """Inserting a stream with absent keys should raise an error.""" files = self.get_versionedfiles() source = self.get_versionedfiles("source") stream = source.get_record_stream( [(b"missing",) * self.key_length], "topological", False ) self.assertRaises(RevisionNotPresent, files.insert_record_stream, stream) def test_insert_record_stream_out_of_order(self): """An out of order stream can either error or work.""" files = self.get_versionedfiles() source = self.get_versionedfiles("source") self.get_diamond_files(source) if self.key_length == 1: origin_keys = [(b"origin",)] end_keys = [(b"merged",), (b"left",)] start_keys = [(b"right",), (b"base",)] else: origin_keys = [(b"FileA", b"origin"), (b"FileB", b"origin")] end_keys = [ ( b"FileA", b"merged", ), ( b"FileA", b"left", ), ( b"FileB", b"merged", ), ( b"FileB", b"left", ), ] start_keys = [ ( b"FileA", b"right", ), ( b"FileA", b"base", ), ( b"FileB", b"right", ), ( b"FileB", b"base", ), ] origin_entries = source.get_record_stream(origin_keys, "unordered", False) end_entries = source.get_record_stream(end_keys, "topological", False) start_entries = source.get_record_stream(start_keys, "topological", False) entries = itertools.chain(origin_entries, end_entries, start_entries) try: files.insert_record_stream(entries) except RevisionNotPresent: # Must not have corrupted the file. files.check() else: self.assertIdenticalVersionedFile(source, files) def test_insert_record_stream_long_parent_chain_out_of_order(self): """An out of order stream can either error or work.""" if not self.graph: raise TestNotApplicable("ancestry info only relevant with graph.") # Create a reasonably long chain of records based on each other, where # most will be deltas. source = self.get_versionedfiles("source") parents = () keys = [] content = [(b"same same %d\n" % n) for n in range(500)] letters = b"abcdefghijklmnopqrstuvwxyz" for i in range(len(letters)): letter = letters[i : i + 1] key = (b"key-" + letter,) if self.key_length == 2: key = (b"prefix",) + key content.append(b"content for " + letter + b"\n") source.add_lines(key, parents, content) keys.append(key) parents = (key,) # Create a stream of these records, excluding the first record that the # rest ultimately depend upon, and insert it into a new vf. streams = [] for key in reversed(keys): streams.append(source.get_record_stream([key], "unordered", False)) deltas = itertools.chain.from_iterable(streams[:-1]) files = self.get_versionedfiles() try: files.insert_record_stream(deltas) except RevisionNotPresent: # Must not have corrupted the file. files.check() else: # Must only report either just the first key as a missing parent, # no key as missing (for nodelta scenarios). missing = set(files.get_missing_compression_parent_keys()) missing.discard(keys[0]) self.assertEqual(set(), missing) def get_knit_delta_source(self): """Get a source that can produce a stream with knit delta records, regardless of this test's scenario. """ mapper = self.get_mapper() source_transport = self.get_transport("source") source_transport.mkdir(".") source = make_file_factory(False, mapper)(source_transport) get_diamond_files( source, self.key_length, trailing_eol=True, nograph=False, left_only=False ) return source def test_insert_record_stream_delta_missing_basis_no_corruption(self): """Insertion where a needed basis is not included notifies the caller of the missing basis. In the meantime a record missing its basis is not added. """ source = self.get_knit_delta_source() keys = [self.get_simple_key(b"origin"), self.get_simple_key(b"merged")] entries = source.get_record_stream(keys, "unordered", False) files = self.get_versionedfiles() if self.support_partial_insertion: self.assertEqual([], list(files.get_missing_compression_parent_keys())) files.insert_record_stream(entries) missing_bases = files.get_missing_compression_parent_keys() self.assertEqual({self.get_simple_key(b"left")}, set(missing_bases)) self.assertEqual(set(keys), set(files.get_parent_map(keys))) else: self.assertRaises(RevisionNotPresent, files.insert_record_stream, entries) files.check() def test_insert_record_stream_delta_missing_basis_can_be_added_later(self): """Insertion where a needed basis is not included notifies the caller of the missing basis. That basis can be added in a second insert_record_stream call that does not need to repeat records present in the previous stream. The record(s) that required that basis are fully inserted once their basis is no longer missing. """ if not self.support_partial_insertion: raise TestNotApplicable( "versioned file scenario does not support partial insertion" ) source = self.get_knit_delta_source() entries = source.get_record_stream( [self.get_simple_key(b"origin"), self.get_simple_key(b"merged")], "unordered", False, ) files = self.get_versionedfiles() files.insert_record_stream(entries) missing_bases = files.get_missing_compression_parent_keys() self.assertEqual({self.get_simple_key(b"left")}, set(missing_bases)) # 'merged' is inserted (although a commit of a write group involving # this versionedfiles would fail). merged_key = self.get_simple_key(b"merged") self.assertEqual([merged_key], list(files.get_parent_map([merged_key]).keys())) # Add the full delta closure of the missing records missing_entries = source.get_record_stream(missing_bases, "unordered", True) files.insert_record_stream(missing_entries) # Now 'merged' is fully inserted (and a commit would succeed). self.assertEqual([], list(files.get_missing_compression_parent_keys())) self.assertEqual([merged_key], list(files.get_parent_map([merged_key]).keys())) files.check() def test_iter_lines_added_or_present_in_keys(self): # test that we get at least an equalset of the lines added by # versions in the store. # the ordering here is to make a tree so that dumb searches have # more changes to muck up. class InstrumentedProgress: def __init__(self): self.updates = [] def update(self, msg=None, current=None, total=None): self.updates.append((msg, current, total)) def finished(self): pass files = self.get_versionedfiles() # add a base to get included files.add_lines(self.get_simple_key(b"base"), (), [b"base\n"]) # add a ancestor to be included on one side files.add_lines(self.get_simple_key(b"lancestor"), (), [b"lancestor\n"]) # add a ancestor to be included on the other side files.add_lines( self.get_simple_key(b"rancestor"), self.get_parents([self.get_simple_key(b"base")]), [b"rancestor\n"], ) # add a child of rancestor with no eofile-nl files.add_lines( self.get_simple_key(b"child"), self.get_parents([self.get_simple_key(b"rancestor")]), [b"base\n", b"child\n"], ) # add a child of lancestor and base to join the two roots files.add_lines( self.get_simple_key(b"otherchild"), self.get_parents( [self.get_simple_key(b"lancestor"), self.get_simple_key(b"base")] ), [b"base\n", b"lancestor\n", b"otherchild\n"], ) def iter_with_keys(keys, expected): # now we need to see what lines are returned, and how often. lines = {} progress = InstrumentedProgress() # iterate over the lines for line in files.iter_lines_added_or_present_in_keys(keys, pb=progress): lines.setdefault(line, 0) lines[line] += 1 if progress.updates != []: self.assertEqual(expected, progress.updates) return lines lines = iter_with_keys( [self.get_simple_key(b"child"), self.get_simple_key(b"otherchild")], [ ("Walking content", 0, 2), ("Walking content", 1, 2), ("Walking content", 2, 2), ], ) # we must see child and otherchild self.assertTrue(lines[(b"child\n", self.get_simple_key(b"child"))] > 0) self.assertTrue( lines[(b"otherchild\n", self.get_simple_key(b"otherchild"))] > 0 ) # we dont care if we got more than that. # test all lines lines = iter_with_keys( files.keys(), [ ("Walking content", 0, 5), ("Walking content", 1, 5), ("Walking content", 2, 5), ("Walking content", 3, 5), ("Walking content", 4, 5), ("Walking content", 5, 5), ], ) # all lines must be seen at least once self.assertTrue(lines[(b"base\n", self.get_simple_key(b"base"))] > 0) self.assertTrue(lines[(b"lancestor\n", self.get_simple_key(b"lancestor"))] > 0) self.assertTrue(lines[(b"rancestor\n", self.get_simple_key(b"rancestor"))] > 0) self.assertTrue(lines[(b"child\n", self.get_simple_key(b"child"))] > 0) self.assertTrue( lines[(b"otherchild\n", self.get_simple_key(b"otherchild"))] > 0 ) def test_make_mpdiffs(self): from .. import multiparent files = self.get_versionedfiles("source") # add texts that should trip the knit maximum delta chain threshold # as well as doing parallel chains of data in knits. # this is done by two chains of 25 insertions files.add_lines(self.get_simple_key(b"base"), [], [b"line\n"]) files.add_lines( self.get_simple_key(b"noeol"), self.get_parents([self.get_simple_key(b"base")]), [b"line"], ) # detailed eol tests: # shared last line with parent no-eol files.add_lines( self.get_simple_key(b"noeolsecond"), self.get_parents([self.get_simple_key(b"noeol")]), [b"line\n", b"line"], ) # differing last line with parent, both no-eol files.add_lines( self.get_simple_key(b"noeolnotshared"), self.get_parents([self.get_simple_key(b"noeolsecond")]), [b"line\n", b"phone"], ) # add eol following a noneol parent, change content files.add_lines( self.get_simple_key(b"eol"), self.get_parents([self.get_simple_key(b"noeol")]), [b"phone\n"], ) # add eol following a noneol parent, no change content files.add_lines( self.get_simple_key(b"eolline"), self.get_parents([self.get_simple_key(b"noeol")]), [b"line\n"], ) # noeol with no parents: files.add_lines(self.get_simple_key(b"noeolbase"), [], [b"line"]) # noeol preceeding its leftmost parent in the output: # this is done by making it a merge of two parents with no common # anestry: noeolbase and noeol with the # later-inserted parent the leftmost. files.add_lines( self.get_simple_key(b"eolbeforefirstparent"), self.get_parents( [self.get_simple_key(b"noeolbase"), self.get_simple_key(b"noeol")] ), [b"line"], ) # two identical eol texts files.add_lines( self.get_simple_key(b"noeoldup"), self.get_parents([self.get_simple_key(b"noeol")]), [b"line"], ) next_parent = self.get_simple_key(b"base") text_name = b"chain1-" text = [b"line\n"] for depth in range(26): new_version = self.get_simple_key(text_name + b"%d" % depth) text = text + [b"line\n"] files.add_lines(new_version, self.get_parents([next_parent]), text) next_parent = new_version next_parent = self.get_simple_key(b"base") text_name = b"chain2-" text = [b"line\n"] for depth in range(26): new_version = self.get_simple_key(text_name + b"%d" % depth) text = text + [b"line\n"] files.add_lines(new_version, self.get_parents([next_parent]), text) next_parent = new_version target = self.get_versionedfiles("target") for key in multiparent.topo_iter_keys(files, files.keys()): mpdiff = files.make_mpdiffs([key])[0] parents = files.get_parent_map([key])[key] or [] target.add_mpdiffs([(key, parents, files.get_sha1s([key])[key], mpdiff)]) self.assertEqualDiff( next(files.get_record_stream([key], "unordered", True)).get_bytes_as( "fulltext" ), next(target.get_record_stream([key], "unordered", True)).get_bytes_as( "fulltext" ), ) def test_keys(self): # While use is discouraged, versions() is still needed by aspects of # bzr. files = self.get_versionedfiles() self.assertEqual(set(), set(files.keys())) key = (b"foo",) if self.key_length == 1 else (b"foo", b"bar") files.add_lines(key, (), []) self.assertEqual({key}, set(files.keys())) class VirtualVersionedFilesTests(TestCase): """Basic tests for the VirtualVersionedFiles implementations.""" def _get_parent_map(self, keys): ret = {} for k in keys: if k in self._parent_map: ret[k] = self._parent_map[k] return ret def setUp(self): super().setUp() self._lines = {} self._parent_map = {} self.texts = VirtualVersionedFiles(self._get_parent_map, self._lines.get) def test_add_lines(self): self.assertRaises(NotImplementedError, self.texts.add_lines, b"foo", [], []) def test_add_mpdiffs(self): self.assertRaises(NotImplementedError, self.texts.add_mpdiffs, []) def test_check_noerrors(self): self.texts.check() def test_insert_record_stream(self): self.assertRaises(NotImplementedError, self.texts.insert_record_stream, []) def test_get_sha1s_nonexistent(self): self.assertEqual({}, self.texts.get_sha1s([(b"NONEXISTENT",)])) def test_get_sha1s(self): self._lines[b"key"] = [b"dataline1", b"dataline2"] self.assertEqual( {(b"key",): osutils.sha_strings(self._lines[b"key"])}, self.texts.get_sha1s([(b"key",)]), ) def test_get_parent_map(self): self._parent_map = {b"G": (b"A", b"B")} self.assertEqual( {(b"G",): ((b"A",), (b"B",))}, self.texts.get_parent_map([(b"G",), (b"L",)]) ) def test_get_record_stream(self): self._lines[b"A"] = [b"FOO", b"BAR"] it = self.texts.get_record_stream([(b"A",)], "unordered", True) record = next(it) self.assertEqual("chunked", record.storage_kind) self.assertEqual(b"FOOBAR", record.get_bytes_as("fulltext")) self.assertEqual([b"FOO", b"BAR"], record.get_bytes_as("chunked")) def test_get_record_stream_absent(self): it = self.texts.get_record_stream([(b"A",)], "unordered", True) record = next(it) self.assertEqual("absent", record.storage_kind) def test_iter_lines_added_or_present_in_keys(self): self._lines[b"A"] = [b"FOO", b"BAR"] self._lines[b"B"] = [b"HEY"] self._lines[b"C"] = [b"Alberta"] it = self.texts.iter_lines_added_or_present_in_keys([(b"A",), (b"B",)]) self.assertEqual( sorted([(b"FOO", b"A"), (b"BAR", b"A"), (b"HEY", b"B")]), sorted(it) ) bzrformats_3.4.0.orig/bzrformats/tests/test__btree_serializer.py0000644000000000000000000003064515162115103022316 0ustar00# Copyright (C) 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # """Direct tests of the btree serializer extension.""" import binascii import bisect from . import TestCase, _try_import _compiled_btreeparser_module = _try_import("bzrformats._btree_serializer_pyx") class TestBtreeSerializer(TestCase): def setUp(self): super().setUp() if _compiled_btreeparser_module is None: self.skipTest("bzrformats._btree_serializer_pyx not available") @property def module(self): return _compiled_btreeparser_module class TestHexAndUnhex(TestBtreeSerializer): def assertHexlify(self, as_binary): self.assertEqual( binascii.hexlify(as_binary), self.module._py_hexlify(as_binary) ) def assertUnhexlify(self, as_hex): ba_unhex = binascii.unhexlify(as_hex) mod_unhex = self.module._py_unhexlify(as_hex) if ba_unhex != mod_unhex: mod_hex = b"" if mod_unhex is None else binascii.hexlify(mod_unhex) self.fail( "_py_unhexlify returned a different answer" f" from binascii:\n {binascii.hexlify(ba_unhex)!r}\n != {mod_hex!r}" ) def assertFailUnhexlify(self, as_hex): # Invalid hex content self.assertIs(None, self.module._py_unhexlify(as_hex)) def test_to_hex(self): raw_bytes = bytes(range(256)) for i in range(0, 240, 20): self.assertHexlify(raw_bytes[i : i + 20]) self.assertHexlify(raw_bytes[240:] + raw_bytes[0:4]) def test_from_hex(self): self.assertUnhexlify(b"0123456789abcdef0123456789abcdef01234567") self.assertUnhexlify(b"123456789abcdef0123456789abcdef012345678") self.assertUnhexlify(b"0123456789ABCDEF0123456789ABCDEF01234567") self.assertUnhexlify(b"123456789ABCDEF0123456789ABCDEF012345678") hex_chars = binascii.hexlify(bytes(range(256))) for i in range(0, 480, 40): self.assertUnhexlify(hex_chars[i : i + 40]) self.assertUnhexlify(hex_chars[480:] + hex_chars[0:8]) def test_from_invalid_hex(self): self.assertFailUnhexlify(b"123456789012345678901234567890123456789X") self.assertFailUnhexlify(b"12345678901234567890123456789012345678X9") def test_bad_argument(self): self.assertRaises(ValueError, self.module._py_unhexlify, "1a") self.assertRaises(ValueError, self.module._py_unhexlify, b"1b") _hex_form = b"123456789012345678901234567890abcdefabcd" class Test_KeyToSha1(TestBtreeSerializer): def assertKeyToSha1(self, expected, key): expected_bin = None if expected is None else binascii.unhexlify(expected) actual_sha1 = self.module._py_key_to_sha1(key) if expected_bin != actual_sha1: if actual_sha1 is not None: binascii.hexlify(actual_sha1) self.fail(f"_key_to_sha1 returned:\n {actual_sha1}\n != {expected}") def test_simple(self): self.assertKeyToSha1(_hex_form, (b"sha1:" + _hex_form,)) def test_invalid_not_tuple(self): self.assertKeyToSha1(None, _hex_form) self.assertKeyToSha1(None, b"sha1:" + _hex_form) def test_invalid_empty(self): self.assertKeyToSha1(None, ()) def test_invalid_not_string(self): self.assertKeyToSha1(None, (None,)) self.assertKeyToSha1(None, (list(_hex_form),)) def test_invalid_not_sha1(self): self.assertKeyToSha1(None, (_hex_form,)) self.assertKeyToSha1(None, (b"sha2:" + _hex_form,)) def test_invalid_not_hex(self): self.assertKeyToSha1(None, (b"sha1:abcdefghijklmnopqrstuvwxyz12345678901234",)) class Test_Sha1ToKey(TestBtreeSerializer): def assertSha1ToKey(self, hex_sha1): bin_sha1 = binascii.unhexlify(hex_sha1) key = self.module._py_sha1_to_key(bin_sha1) self.assertEqual((b"sha1:" + hex_sha1,), key) def test_simple(self): self.assertSha1ToKey(_hex_form) _one_key_content = b"""type=leaf sha1:123456789012345678901234567890abcdefabcd\x00\x001 2 3 4 """ _large_offsets = b"""type=leaf sha1:123456789012345678901234567890abcdefabcd\x00\x0012345678901 1234567890 0 1 sha1:abcd123456789012345678901234567890abcdef\x00\x002147483648 2147483647 0 1 sha1:abcdefabcd123456789012345678901234567890\x00\x004294967296 4294967295 4294967294 1 """ _multi_key_content = b"""type=leaf sha1:c80c881d4a26984ddce795f6f71817c9cf4480e7\x00\x000 0 0 0 sha1:c86f7e437faa5a7fce15d1ddcb9eaeaea377667b\x00\x001 1 1 1 sha1:c8e240de74fb1ed08fa08d38063f6a6a91462a81\x00\x002 2 2 2 sha1:cda39a3ee5e6b4b0d3255bfef95601890afd8070\x00\x003 3 3 3 sha1:cdf51e37c269aa94d38f93e537bf6e2020b21406\x00\x004 4 4 4 sha1:ce0c9035898dd52fc65c41454cec9c4d2611bfb3\x00\x005 5 5 5 sha1:ce93b4e3c464ffd51732fbd6ded717e9efda28aa\x00\x006 6 6 6 sha1:cf7a9e24777ec23212c54d7a350bc5bea5477fdb\x00\x007 7 7 7 """ _multi_key_same_offset = b"""type=leaf sha1:080c881d4a26984ddce795f6f71817c9cf4480e7\x00\x000 0 0 0 sha1:c86f7e437faa5a7fce15d1ddcb9eaeaea377667b\x00\x001 1 1 1 sha1:cd0c9035898dd52fc65c41454cec9c4d2611bfb3\x00\x002 2 2 2 sha1:cda39a3ee5e6b4b0d3255bfef95601890afd8070\x00\x003 3 3 3 sha1:cde240de74fb1ed08fa08d38063f6a6a91462a81\x00\x004 4 4 4 sha1:cdf51e37c269aa94d38f93e537bf6e2020b21406\x00\x005 5 5 5 sha1:ce7a9e24777ec23212c54d7a350bc5bea5477fdb\x00\x006 6 6 6 sha1:ce93b4e3c464ffd51732fbd6ded717e9efda28aa\x00\x007 7 7 7 """ _common_32_bits = b"""type=leaf sha1:123456784a26984ddce795f6f71817c9cf4480e7\x00\x000 0 0 0 sha1:1234567874fb1ed08fa08d38063f6a6a91462a81\x00\x001 1 1 1 sha1:12345678777ec23212c54d7a350bc5bea5477fdb\x00\x002 2 2 2 sha1:123456787faa5a7fce15d1ddcb9eaeaea377667b\x00\x003 3 3 3 sha1:12345678898dd52fc65c41454cec9c4d2611bfb3\x00\x004 4 4 4 sha1:12345678c269aa94d38f93e537bf6e2020b21406\x00\x005 5 5 5 sha1:12345678c464ffd51732fbd6ded717e9efda28aa\x00\x006 6 6 6 sha1:12345678e5e6b4b0d3255bfef95601890afd8070\x00\x007 7 7 7 """ class TestGCCKHSHA1LeafNode(TestBtreeSerializer): def assertInvalid(self, data): """Ensure that we get a proper error when trying to parse invalid bytes. (mostly this is testing that bad input doesn't cause us to segfault) """ self.assertRaises( (ValueError, TypeError), self.module._parse_into_chk, data, 1, 0 ) def test_non_bytes(self): self.assertInvalid("type=leaf\n") def test_not_leaf(self): self.assertInvalid(b"type=internal\n") def test_empty_leaf(self): leaf = self.module._parse_into_chk(b"type=leaf\n", 1, 0) self.assertEqual(0, len(leaf)) self.assertEqual([], leaf.all_items()) self.assertEqual([], leaf.all_keys()) # It should allow any key to be queried self.assertNotIn(("key",), leaf) def test_one_key_leaf(self): leaf = self.module._parse_into_chk(_one_key_content, 1, 0) self.assertEqual(1, len(leaf)) sha_key = (b"sha1:" + _hex_form,) self.assertEqual([sha_key], leaf.all_keys()) self.assertEqual([(sha_key, (b"1 2 3 4", ()))], leaf.all_items()) self.assertIn(sha_key, leaf) def test_large_offsets(self): leaf = self.module._parse_into_chk(_large_offsets, 1, 0) self.assertEqual( [ b"12345678901 1234567890 0 1", b"2147483648 2147483647 0 1", b"4294967296 4294967295 4294967294 1", ], [x[1][0] for x in leaf.all_items()], ) def test_many_key_leaf(self): leaf = self.module._parse_into_chk(_multi_key_content, 1, 0) self.assertEqual(8, len(leaf)) all_keys = leaf.all_keys() self.assertEqual(8, len(leaf.all_keys())) for idx, key in enumerate(all_keys): self.assertEqual(b"%d" % idx, leaf[key][0].split()[0]) def test_common_shift(self): # The keys were deliberately chosen so that the first 5 bits all # overlapped, it also happens that a later bit overlaps # Note that by 'overlap' we mean that given bit is either on in all # keys, or off in all keys leaf = self.module._parse_into_chk(_multi_key_content, 1, 0) self.assertEqual(19, leaf.common_shift) # The interesting byte for each key is # (defined as the 8-bits that come after the common prefix) lst = [1, 13, 28, 180, 190, 193, 210, 239] offsets = leaf._get_offsets() self.assertEqual([bisect.bisect_left(lst, x) for x in range(0, 257)], offsets) for idx, val in enumerate(lst): self.assertEqual(idx, offsets[val]) for idx, key in enumerate(leaf.all_keys()): self.assertEqual(b"%d" % idx, leaf[key][0].split()[0]) def test_multi_key_same_offset(self): # there is no common prefix, though there are some common bits leaf = self.module._parse_into_chk(_multi_key_same_offset, 1, 0) self.assertEqual(24, leaf.common_shift) offsets = leaf._get_offsets() # The interesting byte is just the first 8-bits of the key lst = [8, 200, 205, 205, 205, 205, 206, 206] self.assertEqual([bisect.bisect_left(lst, x) for x in range(0, 257)], offsets) for val in lst: self.assertEqual(lst.index(val), offsets[val]) for idx, key in enumerate(leaf.all_keys()): self.assertEqual(b"%d" % idx, leaf[key][0].split()[0]) def test_all_common_prefix(self): # The first 32 bits of all hashes are the same. This is going to be # pretty much impossible, but I don't want to fail because of this leaf = self.module._parse_into_chk(_common_32_bits, 1, 0) self.assertEqual(0, leaf.common_shift) lst = [0x78] * 8 offsets = leaf._get_offsets() self.assertEqual([bisect.bisect_left(lst, x) for x in range(0, 257)], offsets) for val in lst: self.assertEqual(lst.index(val), offsets[val]) for idx, key in enumerate(leaf.all_keys()): self.assertEqual(b"%d" % idx, leaf[key][0].split()[0]) def test_many_entries(self): # Again, this is almost impossible, but we should still work # It would be hard to fit more that 120 entries in a 4k page, much less # more than 256 of them. but hey, weird stuff happens sometimes lines = [b"type=leaf\n"] for i in range(500): key_str = b"sha1:%04x%s" % (i, _hex_form[:36]) key = (key_str,) lines.append(b"%s\0\0%d %d %d %d\n" % (key_str, i, i, i, i)) data = b"".join(lines) leaf = self.module._parse_into_chk(data, 1, 0) self.assertEqual(24 - 7, leaf.common_shift) offsets = leaf._get_offsets() # This is the interesting bits for each entry lst = [x // 2 for x in range(500)] expected_offsets = [x * 2 for x in range(128)] + [255] * 129 self.assertEqual(expected_offsets, offsets) # We truncate because offsets is an unsigned char. So the bisection # will just say 'greater than the last one' for all the rest lst = lst[:255] self.assertEqual([bisect.bisect_left(lst, x) for x in range(0, 257)], offsets) for val in lst: self.assertEqual(lst.index(val), offsets[val]) for idx, key in enumerate(leaf.all_keys()): self.assertEqual(b"%d" % idx, leaf[key][0].split()[0]) def test__sizeof__(self): # We can't use the exact numbers because of platform variations, etc. # But what we really care about is that it does get bigger with more # content. leaf0 = self.module._parse_into_chk(b"type=leaf\n", 1, 0) leaf1 = self.module._parse_into_chk(_one_key_content, 1, 0) leafN = self.module._parse_into_chk(_multi_key_content, 1, 0) sizeof_1 = leaf1.__sizeof__() - leaf0.__sizeof__() self.assertGreater(sizeof_1, 0) sizeof_N = leafN.__sizeof__() - leaf0.__sizeof__() self.assertEqual(sizeof_1 * len(leafN), sizeof_N) bzrformats_3.4.0.orig/bzrformats/tests/test__chk_map.py0000644000000000000000000000141515162073400020362 0ustar00# Copyright (C) 2009, 2010, 2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for _chk_map_*.""" bzrformats_3.4.0.orig/bzrformats/tests/test__dirstate_helpers.py0000644000000000000000000004441415162115103022324 0ustar00# Copyright (C) 2007-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for the compiled dirstate helpers.""" import bisect import os from testscenarios import load_tests_apply_scenarios from .. import _dirstate_helpers_py, dirstate from . import TestCase load_tests = load_tests_apply_scenarios try: from .. import _dirstate_helpers_pyx as compiled_dirstate_helpers except ImportError: compiled_dirstate_helpers = None helper_scenarios = [("dirstate_Python", {"helpers": _dirstate_helpers_py})] if compiled_dirstate_helpers is not None: helper_scenarios.append(("dirstate_Pyrex", {"helpers": compiled_dirstate_helpers})) class TestBisectPathMixin: """Test that _bisect_path_*() returns the expected values. _bisect_path_* is intended to work like bisect.bisect_*() except it knows it is working on paths that are sorted by ('path', 'to', 'foo') chunks rather than by raw 'path/to/foo'. Test Cases should inherit from this and override ``get_bisect_path`` return their implementation, and ``get_bisect`` to return the matching bisect.bisect_* function. """ def get_bisect_path(self): """Return an implementation of _bisect_path_*.""" raise NotImplementedError def get_bisect(self): """Return a version of bisect.bisect_*. Also, for the 'exists' check, return the offset to the real values. For example bisect_left returns the index of an entry, while bisect_right returns the index *after* an entry :return: (bisect_func, offset) """ raise NotImplementedError def assertBisect(self, paths, split_paths, path, exists=True): """Assert that bisect_split works like bisect_left on the split paths. :param paths: A list of path names :param split_paths: A list of path names that are already split up by directory ('path/to/foo' => ('path', 'to', 'foo')) :param path: The path we are indexing. :param exists: The path should be present, so make sure the final location actually points to the right value. All other arguments will be passed along. """ bisect_path = self.get_bisect_path() self.assertIsInstance(paths, list) bisect_path_idx = bisect_path(paths, path) split_path = self.split_for_dirblocks([path])[0] bisect_func, offset = self.get_bisect() bisect_split_idx = bisect_func(split_paths, split_path) self.assertEqual( bisect_split_idx, bisect_path_idx, "{} disagreed. {} != {} for key {!r}".format( bisect_path.__name__, bisect_split_idx, bisect_path_idx, path ), ) if exists: self.assertEqual(path, paths[bisect_path_idx + offset]) def split_for_dirblocks(self, paths): dir_split_paths = [] for path in paths: dirname, basename = os.path.split(path) dir_split_paths.append((dirname.split(b"/"), basename)) dir_split_paths.sort() return dir_split_paths def test_simple(self): """In the simple case it works just like bisect_left.""" paths = [b"", b"a", b"b", b"c", b"d"] split_paths = self.split_for_dirblocks(paths) for path in paths: self.assertBisect(paths, split_paths, path, exists=True) self.assertBisect(paths, split_paths, b"_", exists=False) self.assertBisect(paths, split_paths, b"aa", exists=False) self.assertBisect(paths, split_paths, b"bb", exists=False) self.assertBisect(paths, split_paths, b"cc", exists=False) self.assertBisect(paths, split_paths, b"dd", exists=False) self.assertBisect(paths, split_paths, b"a/a", exists=False) self.assertBisect(paths, split_paths, b"b/b", exists=False) self.assertBisect(paths, split_paths, b"c/c", exists=False) self.assertBisect(paths, split_paths, b"d/d", exists=False) def test_involved(self): """This is where bisect_path_* diverges slightly.""" # This is the list of paths and their contents # a/ # a/ # a # z # a-a/ # a # a-z/ # z # a=a/ # a # a=z/ # z # z/ # a # z # z-a # z-z # z=a # z=z # a-a/ # a # a-z/ # z # a=a/ # a # a=z/ # z # This is the exact order that is stored by dirstate # All children in a directory are mentioned before an children of # children are mentioned. # So all the root-directory paths, then all the # first sub directory, etc. paths = [ # content of '/' b"", b"a", b"a-a", b"a-z", b"a=a", b"a=z", # content of 'a/' b"a/a", b"a/a-a", b"a/a-z", b"a/a=a", b"a/a=z", b"a/z", b"a/z-a", b"a/z-z", b"a/z=a", b"a/z=z", # content of 'a/a/' b"a/a/a", b"a/a/z", # content of 'a/a-a' b"a/a-a/a", # content of 'a/a-z' b"a/a-z/z", # content of 'a/a=a' b"a/a=a/a", # content of 'a/a=z' b"a/a=z/z", # content of 'a/z/' b"a/z/a", b"a/z/z", # content of 'a-a' b"a-a/a", # content of 'a-z' b"a-z/z", # content of 'a=a' b"a=a/a", # content of 'a=z' b"a=z/z", ] split_paths = self.split_for_dirblocks(paths) sorted_paths = [] for dir_parts, basename in split_paths: if dir_parts == [b""]: sorted_paths.append(basename) else: sorted_paths.append(b"/".join(dir_parts + [basename])) self.assertEqual(sorted_paths, paths) for path in paths: self.assertBisect(paths, split_paths, path, exists=True) class TestBisectPathLeft(TestCase, TestBisectPathMixin): """Run all Bisect Path tests against bisect_path_left.""" def get_bisect_path(self): from ..dirstate import bisect_path_left return bisect_path_left def get_bisect(self): return bisect.bisect_left, 0 class TestBisectPathRight(TestCase, TestBisectPathMixin): """Run all Bisect Path tests against bisect_path_right.""" def get_bisect_path(self): from ..dirstate import bisect_path_right return bisect_path_right def get_bisect(self): return bisect.bisect_right, -1 class TestLtByDirs(TestCase): """Test an implementation of lt_by_dirs(). lt_by_dirs() compares 2 paths by their directory sections, rather than as plain strings. """ def assertCmpByDirs(self, expected, str1, str2): """Compare the two strings, in both directions. :param expected: The expected comparison value. -1 means str1 comes first, 0 means they are equal, 1 means str2 comes first :param str1: string to compare :param str2: string to compare """ if expected == 0: self.assertEqual(str1, str2) self.assertFalse(dirstate.lt_by_dirs(str1, str2)) self.assertFalse(dirstate.lt_by_dirs(str2, str1)) elif expected > 0: self.assertFalse(dirstate.lt_by_dirs(str1, str2)) self.assertTrue(dirstate.lt_by_dirs(str2, str1)) else: self.assertTrue(dirstate.lt_by_dirs(str1, str2)) self.assertFalse(dirstate.lt_by_dirs(str2, str1)) def test_cmp_empty(self): """Compare against the empty string.""" self.assertCmpByDirs(0, b"", b"") self.assertCmpByDirs(1, b"a", b"") self.assertCmpByDirs(1, b"ab", b"") self.assertCmpByDirs(1, b"abc", b"") self.assertCmpByDirs(1, b"abcd", b"") self.assertCmpByDirs(1, b"abcde", b"") self.assertCmpByDirs(1, b"abcdef", b"") self.assertCmpByDirs(1, b"abcdefg", b"") self.assertCmpByDirs(1, b"abcdefgh", b"") self.assertCmpByDirs(1, b"abcdefghi", b"") self.assertCmpByDirs(1, b"test/ing/a/path/", b"") def test_cmp_same_str(self): """Compare the same string.""" self.assertCmpByDirs(0, b"a", b"a") self.assertCmpByDirs(0, b"ab", b"ab") self.assertCmpByDirs(0, b"abc", b"abc") self.assertCmpByDirs(0, b"abcd", b"abcd") self.assertCmpByDirs(0, b"abcde", b"abcde") self.assertCmpByDirs(0, b"abcdef", b"abcdef") self.assertCmpByDirs(0, b"abcdefg", b"abcdefg") self.assertCmpByDirs(0, b"abcdefgh", b"abcdefgh") self.assertCmpByDirs(0, b"abcdefghi", b"abcdefghi") self.assertCmpByDirs(0, b"testing a long string", b"testing a long string") self.assertCmpByDirs(0, b"x" * 10000, b"x" * 10000) self.assertCmpByDirs(0, b"a/b", b"a/b") self.assertCmpByDirs(0, b"a/b/c", b"a/b/c") self.assertCmpByDirs(0, b"a/b/c/d", b"a/b/c/d") self.assertCmpByDirs(0, b"a/b/c/d/e", b"a/b/c/d/e") def test_simple_paths(self): """Compare strings that act like normal string comparison.""" self.assertCmpByDirs(-1, b"a", b"b") self.assertCmpByDirs(-1, b"aa", b"ab") self.assertCmpByDirs(-1, b"ab", b"bb") self.assertCmpByDirs(-1, b"aaa", b"aab") self.assertCmpByDirs(-1, b"aab", b"abb") self.assertCmpByDirs(-1, b"abb", b"bbb") self.assertCmpByDirs(-1, b"aaaa", b"aaab") self.assertCmpByDirs(-1, b"aaab", b"aabb") self.assertCmpByDirs(-1, b"aabb", b"abbb") self.assertCmpByDirs(-1, b"abbb", b"bbbb") self.assertCmpByDirs(-1, b"aaaaa", b"aaaab") self.assertCmpByDirs(-1, b"a/a", b"a/b") self.assertCmpByDirs(-1, b"a/b", b"b/b") self.assertCmpByDirs(-1, b"a/a/a", b"a/a/b") self.assertCmpByDirs(-1, b"a/a/b", b"a/b/b") self.assertCmpByDirs(-1, b"a/b/b", b"b/b/b") self.assertCmpByDirs(-1, b"a/a/a/a", b"a/a/a/b") self.assertCmpByDirs(-1, b"a/a/a/b", b"a/a/b/b") self.assertCmpByDirs(-1, b"a/a/b/b", b"a/b/b/b") self.assertCmpByDirs(-1, b"a/b/b/b", b"b/b/b/b") self.assertCmpByDirs(-1, b"a/a/a/a/a", b"a/a/a/a/b") def test_tricky_paths(self): self.assertCmpByDirs(1, b"ab/cd/ef", b"ab/cc/ef") self.assertCmpByDirs(1, b"ab/cd/ef", b"ab/c/ef") self.assertCmpByDirs(-1, b"ab/cd/ef", b"ab/cd-ef") self.assertCmpByDirs(-1, b"ab/cd", b"ab/cd-") self.assertCmpByDirs(-1, b"ab/cd", b"ab-cd") def test_cmp_non_ascii(self): self.assertCmpByDirs(-1, b"\xc2\xb5", b"\xc3\xa5") # u'\xb5', u'\xe5' self.assertCmpByDirs(-1, b"a", b"\xc3\xa5") # u'a', u'\xe5' self.assertCmpByDirs(-1, b"b", b"\xc2\xb5") # u'b', u'\xb5' self.assertCmpByDirs(-1, b"a/b", b"a/\xc3\xa5") # u'a/b', u'a/\xe5' self.assertCmpByDirs(-1, b"b/a", b"b/\xc2\xb5") # u'b/a', u'b/\xb5' class TestLtPathByDirblock(TestCase): """Test an implementation of lt_path_by_dirblock(). lt_path_by_dirblock() compares two paths using the sort order used by DirState. All paths in the same directory are sorted together. Child test cases can override ``get_lt_path_by_dirblock`` to test a specific implementation. """ def get_lt_path_by_dirblock(self): """Get a specific implementation of lt_path_by_dirblock.""" from ..dirstate import lt_path_by_dirblock return lt_path_by_dirblock def assertLtPathByDirblock(self, paths): """Compare all paths and make sure they evaluate to the correct order. This does N^2 comparisons. It is assumed that ``paths`` is properly sorted list. :param paths: a sorted list of paths to compare """ # First, make sure the paths being passed in are correct def _key(p): dirname, basename = os.path.split(p) return dirname.split(b"/"), basename self.assertEqual(sorted(paths, key=_key), paths) lt_path_by_dirblock = self.get_lt_path_by_dirblock() for idx1, path1 in enumerate(paths): for idx2, path2 in enumerate(paths): lt_result = lt_path_by_dirblock(path1, path2) self.assertEqual( idx1 < idx2, lt_result, "{} did not state that {!r} < {!r}, lt={}".format( lt_path_by_dirblock.__name__, path1, path2, lt_result ), ) def test_cmp_simple_paths(self): """Compare against the empty string.""" self.assertLtPathByDirblock([b"", b"a", b"ab", b"abc", b"a/b/c", b"b/d/e"]) self.assertLtPathByDirblock([b"kl", b"ab/cd", b"ab/ef", b"gh/ij"]) def test_tricky_paths(self): self.assertLtPathByDirblock( [ # Contents of '' b"", b"a", b"a-a", b"a=a", b"b", # Contents of 'a' b"a/a", b"a/a-a", b"a/a=a", b"a/b", # Contents of 'a/a' b"a/a/a", b"a/a/a-a", b"a/a/a=a", # Contents of 'a/a/a' b"a/a/a/a", b"a/a/a/b", # Contents of 'a/a/a-a', b"a/a/a-a/a", b"a/a/a-a/b", # Contents of 'a/a/a=a', b"a/a/a=a/a", b"a/a/a=a/b", # Contents of 'a/a-a' b"a/a-a/a", # Contents of 'a/a-a/a' b"a/a-a/a/a", b"a/a-a/a/b", # Contents of 'a/a=a' b"a/a=a/a", # Contents of 'a/b' b"a/b/a", b"a/b/b", # Contents of 'a-a', b"a-a/a", b"a-a/b", # Contents of 'a=a', b"a=a/a", b"a=a/b", # Contents of 'b', b"b/a", b"b/b", ] ) self.assertLtPathByDirblock( [ # content of '/' b"", b"a", b"a-a", b"a-z", b"a=a", b"a=z", # content of 'a/' b"a/a", b"a/a-a", b"a/a-z", b"a/a=a", b"a/a=z", b"a/z", b"a/z-a", b"a/z-z", b"a/z=a", b"a/z=z", # content of 'a/a/' b"a/a/a", b"a/a/z", # content of 'a/a-a' b"a/a-a/a", # content of 'a/a-z' b"a/a-z/z", # content of 'a/a=a' b"a/a=a/a", # content of 'a/a=z' b"a/a=z/z", # content of 'a/z/' b"a/z/a", b"a/z/z", # content of 'a-a' b"a-a/a", # content of 'a-z' b"a-z/z", # content of 'a=a' b"a=a/a", # content of 'a=z' b"a=z/z", ] ) def test_nonascii(self): self.assertLtPathByDirblock( [ # content of '/' b"", b"a", b"\xc2\xb5", b"\xc3\xa5", # content of 'a' b"a/a", b"a/\xc2\xb5", b"a/\xc3\xa5", # content of 'a/a' b"a/a/a", b"a/a/\xc2\xb5", b"a/a/\xc3\xa5", # content of 'a/\xc2\xb5' b"a/\xc2\xb5/a", b"a/\xc2\xb5/\xc2\xb5", b"a/\xc2\xb5/\xc3\xa5", # content of 'a/\xc3\xa5' b"a/\xc3\xa5/a", b"a/\xc3\xa5/\xc2\xb5", b"a/\xc3\xa5/\xc3\xa5", # content of '\xc2\xb5' b"\xc2\xb5/a", b"\xc2\xb5/\xc2\xb5", b"\xc2\xb5/\xc3\xa5", # content of '\xc2\xe5' b"\xc3\xa5/a", b"\xc3\xa5/\xc2\xb5", b"\xc3\xa5/\xc3\xa5", ] ) class TestUsingCompiledIfAvailable(TestCase): """Check that any compiled functions that are available are the default. It is possible to have typos, etc in the import line, such that _dirstate_helpers_pyx is actually available, but the compiled functions are not being used. """ def test__read_dirblocks(self): if compiled_dirstate_helpers is not None: from .._dirstate_helpers_pyx import _read_dirblocks else: from .._dirstate_helpers_py import _read_dirblocks self.assertIs(_read_dirblocks, dirstate._read_dirblocks) def test_update_entry(self): if compiled_dirstate_helpers is not None: from .._dirstate_helpers_pyx import update_entry else: from ..dirstate import update_entry self.assertIs(update_entry, dirstate.update_entry) def test_process_entry(self): if compiled_dirstate_helpers is not None: from .._dirstate_helpers_pyx import ProcessEntryC self.assertIs(ProcessEntryC, dirstate._process_entry) else: from ..dirstate import ProcessEntryPython self.assertIs(ProcessEntryPython, dirstate._process_entry) bzrformats_3.4.0.orig/bzrformats/tests/test__groupcompress.py0000644000000000000000000004245515162115103021676 0ustar00# Copyright (C) 2008-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for the python and pyrex extensions of groupcompress.""" import sys from testscenarios import load_tests_apply_scenarios from .. import groupcompress from .._bzr_rs import groupcompress as _groupcompress_rs from . import TestCase, _try_import _compiled_groupcompress_module = _try_import("bzrformats._groupcompress_pyx") def module_scenarios(): scenarios = [ ( "line", {"make_delta": groupcompress.make_line_delta}, ), ("rabin", {"make_delta": groupcompress.make_rabin_delta}), ] return scenarios def two_way_scenarios(): scenarios = [ ("LR", {"make_delta": groupcompress.make_line_delta}), ("RR", {"make_delta": groupcompress.make_rabin_delta}), ] return scenarios load_tests = load_tests_apply_scenarios try: from bzrformats import _groupcompress_pyx as _groupcompress_cython except ImportError: _groupcompress_cython = None _text1 = b"""\ This is a bit of source text which is meant to be matched against other text """ _text2 = b"""\ This is a bit of source text which is meant to differ from against other text """ _text3 = b"""\ This is a bit of source text which is meant to be matched against other text except it also has a lot more data at the end of the file """ _first_text = b"""\ a bit of text, that does not have much in common with the next text """ _second_text = b"""\ some more bit of text, that does not have much in common with the previous text and has some extra text """ _third_text = b"""\ a bit of text, that has some in common with the previous text and has some extra text and not have much in common with the next text """ _fourth_text = b"""\ 123456789012345 same rabin hash 123456789012345 same rabin hash 123456789012345 same rabin hash 123456789012345 same rabin hash """ class TestMakeAndApplyDelta(TestCase): scenarios = module_scenarios() _gc_module = None # Set by load_tests def setUp(self): super().setUp() self.apply_delta = _groupcompress_rs.apply_delta self.apply_delta_to_source = _groupcompress_rs.apply_delta_to_source def test_make_delta_is_typesafe(self): self.make_delta(b"a string", b"another string") def _check_make_delta(string1, string2): self.assertRaises(TypeError, self.make_delta, string1, string2) _check_make_delta(b"a string", object()) _check_make_delta(b"a string", "not a string") _check_make_delta(object(), b"a string") _check_make_delta("not a string", b"a string") def test_make_noop_delta(self): ident_delta = self.make_delta(_text1, _text1) self.assertEqual(b"M\x90M", ident_delta) ident_delta = self.make_delta(_text2, _text2) self.assertEqual(b"N\x90N", ident_delta) ident_delta = self.make_delta(_text3, _text3) self.assertEqual(b"\x87\x01\x90\x87", ident_delta) def assertDeltaIn(self, delta1, delta2, delta): """Make sure that the delta bytes match one of the expectations.""" # In general, the python delta matcher gives different results than the # pyrex delta matcher. Both should be valid deltas, though. if delta not in (delta1, delta2): self.fail( b"Delta bytes:\n" b" %r\n" b"not in %r\n" b" or %r" % (delta, delta1, delta2) ) def test_make_delta(self): delta = self.make_delta(_text1, _text2) self.assertDeltaIn( b"N\x90/\x1fdiffer from\nagainst other text\n", b"N\x90\x1d\x1ewhich is meant to differ from\n\x91:\x13", delta, ) delta = self.make_delta(_text2, _text1) self.assertDeltaIn( b"M\x90/\x1ebe matched\nagainst other text\n", b"M\x90\x1d\x1dwhich is meant to be matched\n\x91;\x13", delta, ) delta = self.make_delta(_text3, _text1) self.assertEqual(b"M\x90M", delta) delta = self.make_delta(_text3, _text2) self.assertDeltaIn( b"N\x90/\x1fdiffer from\nagainst other text\n", b"N\x90\x1d\x1ewhich is meant to differ from\n\x91:\x13", delta, ) def test_make_delta_with_large_copies(self): # We want to have a copy that is larger than 64kB, which forces us to # issue multiple copy instructions. big_text = _text3 * 1220 delta = self.make_delta(big_text, big_text) self.assertDeltaIn( b"\xdc\x86\x0a" # Encoding the length of the uncompressed text b"\x80" # Copy 64kB, starting at byte 0 b"\x84\x01" # and another 64kB starting at 64kB b"\xb4\x02\x5c\x83", # And the bit of tail. None, # Both implementations should be identical delta, ) def test_apply_delta_is_typesafe(self): self.apply_delta(_text1, b"M\x90M") self.assertRaises(TypeError, self.apply_delta, object(), b"M\x90M") self.assertRaises( (ValueError, TypeError), self.apply_delta, _text1.decode("latin1"), b"M\x90M", ) self.assertRaises((ValueError, TypeError), self.apply_delta, _text1, "M\x90M") self.assertRaises(TypeError, self.apply_delta, _text1, object()) def test_apply_delta(self): target = self.apply_delta( _text1, b"N\x90/\x1fdiffer from\nagainst other text\n" ) self.assertEqual(_text2, target) target = self.apply_delta(_text2, b"M\x90/\x1ebe matched\nagainst other text\n") self.assertEqual(_text1, target) def test_apply_delta_to_source_is_safe(self): self.assertRaises(TypeError, self.apply_delta_to_source, object(), 0, 1) self.assertRaises(TypeError, self.apply_delta_to_source, "unicode str", 0, 1) # end > length self.assertRaises(ValueError, self.apply_delta_to_source, b"foo", 1, 4) # start > length self.assertRaises(ValueError, self.apply_delta_to_source, b"foo", 5, 3) # start > end self.assertRaises(ValueError, self.apply_delta_to_source, b"foo", 3, 2) def test_apply_delta_to_source(self): source_and_delta = _text1 + b"N\x90/\x1fdiffer from\nagainst other text\n" self.assertEqual( _text2, self.apply_delta_to_source( source_and_delta, len(_text1), len(source_and_delta) ), ) class TestMakeAndApplyCompatible(TestCase): scenarios = two_way_scenarios() make_delta = None # Set by load_tests apply_delta = _groupcompress_rs.apply_delta def assertMakeAndApply(self, source, target): """Assert that generating a delta and applying gives success.""" delta = self.make_delta(source, target) bytes = self.apply_delta(source, delta) self.assertEqualDiff(target, bytes) def test_direct(self): self.assertMakeAndApply(_text1, _text2) self.assertMakeAndApply(_text2, _text1) self.assertMakeAndApply(_text1, _text3) self.assertMakeAndApply(_text3, _text1) self.assertMakeAndApply(_text2, _text3) self.assertMakeAndApply(_text3, _text2) class TestDeltaIndex(TestCase): def setUp(self): super().setUp() # This test isn't multiplied, because we only have DeltaIndex for the # compiled form # We call this here, because _test_needs_features happens after setUp if _groupcompress_cython is None: self.skipTest("Cython _groupcompress module not available") self._gc_module = _groupcompress_cython def test_repr(self): di = self._gc_module.DeltaIndex(b"test text\n") self.assertEqual("DeltaIndex(1, 10)", repr(di)) def test_sizeof(self): di = self._gc_module.DeltaIndex() # Exact value will depend on platform but should include sources # source_info is a pointer and two longs so at least 12 bytes lower_bound = di._max_num_sources * 12 self.assertGreater(sys.getsizeof(di), lower_bound) def test__dump_no_index(self): di = self._gc_module.DeltaIndex() self.assertEqual(None, di._dump_index()) def test__dump_index_simple(self): di = self._gc_module.DeltaIndex() di.add_source(_text1, 0) self.assertFalse(di._has_index()) self.assertEqual(None, di._dump_index()) _ = di.make_delta(_text1) self.assertTrue(di._has_index()) hash_list, entry_list = di._dump_index() self.assertEqual(16, len(hash_list)) self.assertEqual(68, len(entry_list)) just_entries = [ (idx, text_offset, hash_val) for idx, (text_offset, hash_val) in enumerate(entry_list) if text_offset != 0 or hash_val != 0 ] rabin_hash = groupcompress.rabin_hash self.assertEqual( [ (8, 16, rabin_hash(_text1[1:17])), (25, 48, rabin_hash(_text1[33:49])), (34, 32, rabin_hash(_text1[17:33])), (47, 64, rabin_hash(_text1[49:65])), ], just_entries, ) # This ensures that the hash map points to the location we expect it to for entry_idx, _text_offset, hash_val in just_entries: self.assertEqual(entry_idx, hash_list[hash_val & 0xF]) def test__dump_index_two_sources(self): di = self._gc_module.DeltaIndex() di.add_source(_text1, 0) di.add_source(_text2, 2) start2 = len(_text1) + 2 self.assertTrue(di._has_index()) hash_list, entry_list = di._dump_index() self.assertEqual(16, len(hash_list)) self.assertEqual(68, len(entry_list)) just_entries = [ (idx, text_offset, hash_val) for idx, (text_offset, hash_val) in enumerate(entry_list) if text_offset != 0 or hash_val != 0 ] rabin_hash = groupcompress.rabin_hash self.assertEqual( [ (8, 16, rabin_hash(_text1[1:17])), (9, start2 + 16, rabin_hash(_text2[1:17])), (25, 48, rabin_hash(_text1[33:49])), (30, start2 + 64, rabin_hash(_text2[49:65])), (34, 32, rabin_hash(_text1[17:33])), (35, start2 + 32, rabin_hash(_text2[17:33])), (43, start2 + 48, rabin_hash(_text2[33:49])), (47, 64, rabin_hash(_text1[49:65])), ], just_entries, ) # Each entry should be in the appropriate hash bucket. for entry_idx, _text_offset, hash_val in just_entries: hash_idx = hash_val & 0xF self.assertTrue(hash_list[hash_idx] <= entry_idx < hash_list[hash_idx + 1]) def test_first_add_source_doesnt_index_until_make_delta(self): di = self._gc_module.DeltaIndex() self.assertFalse(di._has_index()) di.add_source(_text1, 0) self.assertFalse(di._has_index()) # However, asking to make a delta will trigger the index to be # generated, and will generate a proper delta delta = di.make_delta(_text2) self.assertTrue(di._has_index()) self.assertEqual(b"N\x90/\x1fdiffer from\nagainst other text\n", delta) def test_add_source_max_bytes_to_index(self): di = self._gc_module.DeltaIndex() di._max_bytes_to_index = 3 * 16 di.add_source(_text1, 0) # (77 bytes -1) // 3 = 25 byte stride di.add_source(_text3, 3) # (135 bytes -1) // 3 = 44 byte stride start2 = len(_text1) + 3 hash_list, entry_list = di._dump_index() self.assertEqual(16, len(hash_list)) self.assertEqual(67, len(entry_list)) just_entries = sorted( [ (text_offset, hash_val) for text_offset, hash_val in entry_list if text_offset != 0 or hash_val != 0 ] ) rabin_hash = groupcompress.rabin_hash self.assertEqual( [ (25, rabin_hash(_text1[10:26])), (50, rabin_hash(_text1[35:51])), (75, rabin_hash(_text1[60:76])), (start2 + 44, rabin_hash(_text3[29:45])), (start2 + 88, rabin_hash(_text3[73:89])), (start2 + 132, rabin_hash(_text3[117:133])), ], just_entries, ) def test_second_add_source_triggers_make_index(self): di = self._gc_module.DeltaIndex() self.assertFalse(di._has_index()) di.add_source(_text1, 0) self.assertFalse(di._has_index()) di.add_source(_text2, 0) self.assertTrue(di._has_index()) def test_make_delta(self): di = self._gc_module.DeltaIndex(_text1) delta = di.make_delta(_text2) self.assertEqual(b"N\x90/\x1fdiffer from\nagainst other text\n", delta) def test_delta_against_multiple_sources(self): di = self._gc_module.DeltaIndex() di.add_source(_first_text, 0) self.assertEqual(len(_first_text), di._source_offset) di.add_source(_second_text, 0) self.assertEqual(len(_first_text) + len(_second_text), di._source_offset) delta = di.make_delta(_third_text) result = _groupcompress_rs.apply_delta(_first_text + _second_text, delta) self.assertEqualDiff(_third_text, result) self.assertEqual( b'\x85\x01\x90\x14\x0chas some in \x91v6\x03and\x91d"\x91:\n', delta ) def test_delta_with_offsets(self): di = self._gc_module.DeltaIndex() di.add_source(_first_text, 5) self.assertEqual(len(_first_text) + 5, di._source_offset) di.add_source(_second_text, 10) self.assertEqual(len(_first_text) + len(_second_text) + 15, di._source_offset) delta = di.make_delta(_third_text) self.assertIsNot(None, delta) result = _groupcompress_rs.apply_delta( b"12345" + _first_text + b"1234567890" + _second_text, delta ) self.assertIsNot(None, result) self.assertEqualDiff(_third_text, result) self.assertEqual( b'\x85\x01\x91\x05\x14\x0chas some in \x91\x856\x03and\x91s"\x91?\n', delta, ) def test_delta_with_delta_bytes(self): di = self._gc_module.DeltaIndex() source = _first_text di.add_source(_first_text, 0) self.assertEqual(len(_first_text), di._source_offset) delta = di.make_delta(_second_text) self.assertEqual( b"h\tsome more\x91\x019&previous text\nand has some extra text\n", delta ) di.add_delta_source(delta, 0) source += delta self.assertEqual(len(_first_text) + len(delta), di._source_offset) second_delta = di.make_delta(_third_text) result = _groupcompress_rs.apply_delta(source, second_delta) self.assertEqualDiff(_third_text, result) # We should be able to match against the # 'previous text\nand has some...' that was part of the delta bytes # Note that we don't match the 'common with the', because it isn't long # enough to match in the original text, and those bytes are not present # in the delta for the second text. self.assertEqual( b"\x85\x01\x90\x14\x1chas some in common with the \x91S&\x03and\x91\x18,", second_delta, ) # Add this delta, and create a new delta for the same text. We should # find the remaining text, and only insert the short 'and' text. di.add_delta_source(second_delta, 0) source += second_delta third_delta = di.make_delta(_third_text) result = _groupcompress_rs.apply_delta(source, third_delta) self.assertEqualDiff(_third_text, result) self.assertEqual( b"\x85\x01\x90\x14\x91\x7e\x1c\x91S&\x03and\x91\x18,", third_delta ) # Now create a delta, which we know won't be able to be 'fit' into the # existing index fourth_delta = di.make_delta(_fourth_text) self.assertEqual( _fourth_text, _groupcompress_rs.apply_delta(source, fourth_delta) ) self.assertEqual( b"\x80\x01" b"\x7f123456789012345\nsame rabin hash\n" b"123456789012345\nsame rabin hash\n" b"123456789012345\nsame rabin hash\n" b"123456789012345\nsame rabin hash" b"\x01\n", fourth_delta, ) di.add_delta_source(fourth_delta, 0) source += fourth_delta # With the next delta, everything should be found fifth_delta = di.make_delta(_fourth_text) self.assertEqual( _fourth_text, _groupcompress_rs.apply_delta(source, fifth_delta) ) self.assertEqual(b"\x80\x01\x91\xa7\x7f\x01\n", fifth_delta) bzrformats_3.4.0.orig/bzrformats/tests/test_bisect_multi.py0000644000000000000000000003672015162115103021310 0ustar00# Copyright (C) 2007, 2009, 2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for bisect_multi.""" from ..bisect_multi import bisect_multi_bytes from . import TestCase class TestBisectMultiBytes(TestCase): def test_lookup_no_keys_no_calls(self): calls = [] def missing_content(location_keys): calls.append(location_keys) return ((location_key, False) for location_key in location_keys) self.assertEqual([], list(bisect_multi_bytes(missing_content, 100, []))) self.assertEqual([], calls) def test_lookup_missing_key_no_content(self): """Doing a lookup in a zero-length file still does a single request. This makes sense because the bisector cannot tell how long content is and its more flexible to only stop when the content object says 'False' for a given location, key pair. """ calls = [] def missing_content(location_keys): calls.append(location_keys) return ((location_key, False) for location_key in location_keys) self.assertEqual( [], list(bisect_multi_bytes(missing_content, 0, ["foo", "bar"])) ) self.assertEqual([[(0, "foo"), (0, "bar")]], calls) def test_lookup_missing_key_before_all_others(self): calls = [] def missing_first_content(location_keys): # returns -1 for all keys unless the byte offset is 0 when it # returns False calls.append(location_keys) result = [] for location_key in location_keys: if location_key[0] == 0: result.append((location_key, False)) else: result.append((location_key, -1)) return result # given a 0 length file, this should terminate with one call. self.assertEqual( [], list(bisect_multi_bytes(missing_first_content, 0, ["foo", "bar"])) ) self.assertEqual([[(0, "foo"), (0, "bar")]], calls) del calls[:] # given a 2 length file, this should make two calls - 1, 0. self.assertEqual( [], list(bisect_multi_bytes(missing_first_content, 2, ["foo", "bar"])) ) self.assertEqual( [ [(1, "foo"), (1, "bar")], [(0, "foo"), (0, "bar")], ], calls, ) del calls[:] # given a really long file - 200MB, this should make a series of calls with the # gap between adjactent calls dropping by 50% each time. We choose a # length which just under a power of two to generate a corner case in # bisection - naively using power of two reduction in size can lead to # a very long tail in the bisection process. The current users of # the bisect_multi_bytes api are not expected to be concerned by this, # as the delta gets down to 4K (the minimum we expect to read and # parse) within 16 steps even on a 200MB index (which at 4 keys/K is # 800 thousand keys, and log2 of 800000 is 19 - so we're doing log2 # steps in the worst case there. self.assertEqual( [], list( bisect_multi_bytes(missing_first_content, 268435456 - 1, ["foo", "bar"]) ), ) self.assertEqual( [ [(134217727, "foo"), (134217727, "bar")], [(67108864, "foo"), (67108864, "bar")], [(33554433, "foo"), (33554433, "bar")], [(16777218, "foo"), (16777218, "bar")], [(8388611, "foo"), (8388611, "bar")], [(4194308, "foo"), (4194308, "bar")], [(2097157, "foo"), (2097157, "bar")], [(1048582, "foo"), (1048582, "bar")], [(524295, "foo"), (524295, "bar")], [(262152, "foo"), (262152, "bar")], [(131081, "foo"), (131081, "bar")], [(65546, "foo"), (65546, "bar")], [(32779, "foo"), (32779, "bar")], [(16396, "foo"), (16396, "bar")], [(8205, "foo"), (8205, "bar")], [(4110, "foo"), (4110, "bar")], [(2063, "foo"), (2063, "bar")], [(1040, "foo"), (1040, "bar")], [(529, "foo"), (529, "bar")], [(274, "foo"), (274, "bar")], [(147, "foo"), (147, "bar")], [(84, "foo"), (84, "bar")], [(53, "foo"), (53, "bar")], [(38, "foo"), (38, "bar")], [(31, "foo"), (31, "bar")], [(28, "foo"), (28, "bar")], [(27, "foo"), (27, "bar")], [(26, "foo"), (26, "bar")], [(25, "foo"), (25, "bar")], [(24, "foo"), (24, "bar")], [(23, "foo"), (23, "bar")], [(22, "foo"), (22, "bar")], [(21, "foo"), (21, "bar")], [(20, "foo"), (20, "bar")], [(19, "foo"), (19, "bar")], [(18, "foo"), (18, "bar")], [(17, "foo"), (17, "bar")], [(16, "foo"), (16, "bar")], [(15, "foo"), (15, "bar")], [(14, "foo"), (14, "bar")], [(13, "foo"), (13, "bar")], [(12, "foo"), (12, "bar")], [(11, "foo"), (11, "bar")], [(10, "foo"), (10, "bar")], [(9, "foo"), (9, "bar")], [(8, "foo"), (8, "bar")], [(7, "foo"), (7, "bar")], [(6, "foo"), (6, "bar")], [(5, "foo"), (5, "bar")], [(4, "foo"), (4, "bar")], [(3, "foo"), (3, "bar")], [(2, "foo"), (2, "bar")], [(1, "foo"), (1, "bar")], [(0, "foo"), (0, "bar")], ], calls, ) def test_lookup_missing_key_after_all_others(self): calls = [] end = None def missing_last_content(location_keys): # returns +1 for all keys unless the byte offset is 'end' when it # returns False calls.append(location_keys) result = [] for location_key in location_keys: if location_key[0] == end: result.append((location_key, False)) else: result.append((location_key, +1)) return result # given a 0 length file, this should terminate with one call. end = 0 self.assertEqual( [], list(bisect_multi_bytes(missing_last_content, 0, ["foo", "bar"])) ) self.assertEqual([[(0, "foo"), (0, "bar")]], calls) del calls[:] end = 2 # given a 3 length file, this should make two calls - 1, 2. self.assertEqual( [], list(bisect_multi_bytes(missing_last_content, 3, ["foo", "bar"])) ) self.assertEqual( [ [(1, "foo"), (1, "bar")], [(2, "foo"), (2, "bar")], ], calls, ) del calls[:] end = 268435456 - 2 # see the really-big lookup series in # test_lookup_missing_key_before_all_others for details about this # assertion. self.assertEqual( [], list( bisect_multi_bytes(missing_last_content, 268435456 - 1, ["foo", "bar"]) ), ) self.assertEqual( [ [(134217727, "foo"), (134217727, "bar")], [(201326590, "foo"), (201326590, "bar")], [(234881021, "foo"), (234881021, "bar")], [(251658236, "foo"), (251658236, "bar")], [(260046843, "foo"), (260046843, "bar")], [(264241146, "foo"), (264241146, "bar")], [(266338297, "foo"), (266338297, "bar")], [(267386872, "foo"), (267386872, "bar")], [(267911159, "foo"), (267911159, "bar")], [(268173302, "foo"), (268173302, "bar")], [(268304373, "foo"), (268304373, "bar")], [(268369908, "foo"), (268369908, "bar")], [(268402675, "foo"), (268402675, "bar")], [(268419058, "foo"), (268419058, "bar")], [(268427249, "foo"), (268427249, "bar")], [(268431344, "foo"), (268431344, "bar")], [(268433391, "foo"), (268433391, "bar")], [(268434414, "foo"), (268434414, "bar")], [(268434925, "foo"), (268434925, "bar")], [(268435180, "foo"), (268435180, "bar")], [(268435307, "foo"), (268435307, "bar")], [(268435370, "foo"), (268435370, "bar")], [(268435401, "foo"), (268435401, "bar")], [(268435416, "foo"), (268435416, "bar")], [(268435423, "foo"), (268435423, "bar")], [(268435426, "foo"), (268435426, "bar")], [(268435427, "foo"), (268435427, "bar")], [(268435428, "foo"), (268435428, "bar")], [(268435429, "foo"), (268435429, "bar")], [(268435430, "foo"), (268435430, "bar")], [(268435431, "foo"), (268435431, "bar")], [(268435432, "foo"), (268435432, "bar")], [(268435433, "foo"), (268435433, "bar")], [(268435434, "foo"), (268435434, "bar")], [(268435435, "foo"), (268435435, "bar")], [(268435436, "foo"), (268435436, "bar")], [(268435437, "foo"), (268435437, "bar")], [(268435438, "foo"), (268435438, "bar")], [(268435439, "foo"), (268435439, "bar")], [(268435440, "foo"), (268435440, "bar")], [(268435441, "foo"), (268435441, "bar")], [(268435442, "foo"), (268435442, "bar")], [(268435443, "foo"), (268435443, "bar")], [(268435444, "foo"), (268435444, "bar")], [(268435445, "foo"), (268435445, "bar")], [(268435446, "foo"), (268435446, "bar")], [(268435447, "foo"), (268435447, "bar")], [(268435448, "foo"), (268435448, "bar")], [(268435449, "foo"), (268435449, "bar")], [(268435450, "foo"), (268435450, "bar")], [(268435451, "foo"), (268435451, "bar")], [(268435452, "foo"), (268435452, "bar")], [(268435453, "foo"), (268435453, "bar")], [(268435454, "foo"), (268435454, "bar")], ], calls, ) def test_lookup_when_a_key_is_missing_continues(self): calls = [] def missing_foo_otherwise_missing_first_content(location_keys): # returns -1 for all keys unless the byte offset is 0 when it # returns False calls.append(location_keys) result = [] for location_key in location_keys: if location_key[1] == "foo" or location_key[0] == 0: result.append((location_key, False)) else: result.append((location_key, -1)) return result # given a 2 length file, this should terminate with two calls, one for # both keys, and one for bar only. self.assertEqual( [], list( bisect_multi_bytes( missing_foo_otherwise_missing_first_content, 2, ["foo", "bar"] ) ), ) self.assertEqual( [ [(1, "foo"), (1, "bar")], [(0, "bar")], ], calls, ) def test_found_keys_returned_other_searches_continue(self): calls = [] def find_bar_at_1_foo_missing_at_0(location_keys): calls.append(location_keys) result = [] for location_key in location_keys: if location_key == (1, "bar"): result.append((location_key, "bar-result")) elif location_key[0] == 0: result.append((location_key, False)) else: result.append((location_key, -1)) return result # given a 4 length file, this should terminate with three calls, two for # both keys, and one for foo only. self.assertEqual( [("bar", "bar-result")], list(bisect_multi_bytes(find_bar_at_1_foo_missing_at_0, 4, ["foo", "bar"])), ) self.assertEqual( [ [(2, "foo"), (2, "bar")], [(1, "foo"), (1, "bar")], [(0, "foo")], ], calls, ) def test_searches_different_keys_in_different_directions(self): calls = [] def missing_bar_at_1_foo_at_3(location_keys): calls.append(location_keys) result = [] for location_key in location_keys: if location_key[1] == "bar": if location_key[0] == 1: result.append((location_key, False)) else: # search down result.append((location_key, -1)) elif location_key[1] == "foo": if location_key[0] == 3: result.append((location_key, False)) else: # search up result.append((location_key, +1)) return result # given a 4 length file, this should terminate with two calls. self.assertEqual( [], list(bisect_multi_bytes(missing_bar_at_1_foo_at_3, 4, ["foo", "bar"])) ) self.assertEqual( [ [(2, "foo"), (2, "bar")], [(3, "foo"), (1, "bar")], ], calls, ) def test_change_direction_in_single_key_search(self): # check that we can search down, up, down again - # so length 8, goes 4, 6, 5 calls = [] def missing_at_5(location_keys): calls.append(location_keys) result = [] for location_key in location_keys: if location_key[0] == 5: result.append((location_key, False)) elif location_key[0] > 5: # search down result.append((location_key, -1)) else: # search up result.append((location_key, +1)) return result # given a 8 length file, this should terminate with three calls. self.assertEqual([], list(bisect_multi_bytes(missing_at_5, 8, ["foo", "bar"]))) self.assertEqual( [ [(4, "foo"), (4, "bar")], [(6, "foo"), (6, "bar")], [(5, "foo"), (5, "bar")], ], calls, ) bzrformats_3.4.0.orig/bzrformats/tests/test_btree_index.py0000644000000000000000000022130315162115103021106 0ustar00# Copyright (C) 2008-2012, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for btree indices.""" import pprint import time import zlib from testscenarios import load_tests_apply_scenarios from .. import btree_index, lru_cache, osutils from .. import index as _mod_index from ..lru_cache import FIFOCache from ..transport import MemoryTransport, TracingTransport from . import TestCase, TestCaseWithMemoryTransport, _try_import load_tests = load_tests_apply_scenarios _compiled_btreeparser_module = _try_import("bzrformats._btree_serializer_pyx") def btreeparser_scenarios(): import bzrformats._btree_serializer_py as py_module scenarios = [("python", {"parse_btree": py_module})] if _compiled_btreeparser_module is not None: scenarios.append(("C", {"parse_btree": _compiled_btreeparser_module})) return scenarios class BTreeTestCase(TestCaseWithMemoryTransport): # test names here are suffixed by the key length and reference list count # that they test. def setUp(self): super().setUp() self.overrideAttr(btree_index, "_RESERVED_HEADER_BYTES", 100) def make_nodes(self, count, key_elements, reference_lists): """Generate count*key_elements sample nodes.""" def _pos_to_key(pos, lead=b""): return (lead + (b"%d" % pos) * 40,) keys = [] for prefix_pos in range(key_elements): prefix = _pos_to_key(prefix_pos) if key_elements - 1 else () for pos in range(count): # TODO: This creates odd keys. When count == 100,000, it # creates a 240 byte key key = prefix + _pos_to_key(pos) value = b"value:%d" % pos if reference_lists: # generate some references refs = [] for list_pos in range(reference_lists): # as many keys in each list as its index + the key depth # mod 2 - this generates both 0 length lists and # ones slightly longer than the number of lists. # It also ensures we have non homogeneous lists. refs.append([]) for ref_pos in range(list_pos + pos % 2): if pos % 2: # refer to a nearby key refs[-1].append(prefix + _pos_to_key(pos - 1, b"ref")) else: # serial of this ref in the ref list refs[-1].append(prefix + _pos_to_key(ref_pos, b"ref")) refs[-1] = tuple(refs[-1]) refs = tuple(refs) else: refs = () keys.append((key, value, refs)) return keys def shrink_page_size(self): """Shrink the default page size so that less fits in a page.""" self.overrideAttr(btree_index, "_PAGE_SIZE", 2048) def assertEqualApproxCompressed(self, expected, actual, slop=6): """Check a count of compressed bytes is approximately as expected. Relying on compressed length being stable even with fixed inputs is slightly bogus, but zlib is stable enough that this mostly works. """ if not expected - slop < actual < expected + slop: self.fail( "Expected around %d bytes compressed but got %d" % (expected, actual) ) class TestBTreeBuilder(BTreeTestCase): def test_clear_cache(self): builder = btree_index.BTreeBuilder(reference_lists=0, key_elements=1) # This is a no-op, but we need the api to be consistent with other # BTreeGraphIndex apis. builder.clear_cache() def test_empty_1_0(self): builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=0\nkey_elements=1\nlen=0\n" b"row_lengths=\n", content, ) def test_empty_2_1(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=1) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=1\nkey_elements=2\nlen=0\n" b"row_lengths=\n", content, ) def test_root_leaf_1_0(self): builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0) nodes = self.make_nodes(5, 1, 0) for node in nodes: builder.add_node(*node) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqual(131, len(content)) self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=0\nkey_elements=1\nlen=5\n" b"row_lengths=1\n", content[:73], ) node_content = content[73:] node_bytes = zlib.decompress(node_content) expected_node = ( b"type=leaf\n" b"0000000000000000000000000000000000000000\x00\x00value:0\n" b"1111111111111111111111111111111111111111\x00\x00value:1\n" b"2222222222222222222222222222222222222222\x00\x00value:2\n" b"3333333333333333333333333333333333333333\x00\x00value:3\n" b"4444444444444444444444444444444444444444\x00\x00value:4\n" ) self.assertEqual(expected_node, node_bytes) def test_root_leaf_2_2(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) nodes = self.make_nodes(5, 2, 2) for node in nodes: builder.add_node(*node) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqual(238, len(content)) self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=2\nkey_elements=2\nlen=10\n" b"row_lengths=1\n", content[:74], ) node_content = content[74:] node_bytes = zlib.decompress(node_content) expected_node = ( b"type=leaf\n" b"0000000000000000000000000000000000000000\x000000000000000000000000000000000000000000\x00\t0000000000000000000000000000000000000000\x00ref0000000000000000000000000000000000000000\x00value:0\n" b"0000000000000000000000000000000000000000\x001111111111111111111111111111111111111111\x000000000000000000000000000000000000000000\x00ref0000000000000000000000000000000000000000\t0000000000000000000000000000000000000000\x00ref0000000000000000000000000000000000000000\r0000000000000000000000000000000000000000\x00ref0000000000000000000000000000000000000000\x00value:1\n" b"0000000000000000000000000000000000000000\x002222222222222222222222222222222222222222\x00\t0000000000000000000000000000000000000000\x00ref0000000000000000000000000000000000000000\x00value:2\n" b"0000000000000000000000000000000000000000\x003333333333333333333333333333333333333333\x000000000000000000000000000000000000000000\x00ref2222222222222222222222222222222222222222\t0000000000000000000000000000000000000000\x00ref2222222222222222222222222222222222222222\r0000000000000000000000000000000000000000\x00ref2222222222222222222222222222222222222222\x00value:3\n" b"0000000000000000000000000000000000000000\x004444444444444444444444444444444444444444\x00\t0000000000000000000000000000000000000000\x00ref0000000000000000000000000000000000000000\x00value:4\n" b"1111111111111111111111111111111111111111\x000000000000000000000000000000000000000000\x00\t1111111111111111111111111111111111111111\x00ref0000000000000000000000000000000000000000\x00value:0\n" b"1111111111111111111111111111111111111111\x001111111111111111111111111111111111111111\x001111111111111111111111111111111111111111\x00ref0000000000000000000000000000000000000000\t1111111111111111111111111111111111111111\x00ref0000000000000000000000000000000000000000\r1111111111111111111111111111111111111111\x00ref0000000000000000000000000000000000000000\x00value:1\n" b"1111111111111111111111111111111111111111\x002222222222222222222222222222222222222222\x00\t1111111111111111111111111111111111111111\x00ref0000000000000000000000000000000000000000\x00value:2\n" b"1111111111111111111111111111111111111111\x003333333333333333333333333333333333333333\x001111111111111111111111111111111111111111\x00ref2222222222222222222222222222222222222222\t1111111111111111111111111111111111111111\x00ref2222222222222222222222222222222222222222\r1111111111111111111111111111111111111111\x00ref2222222222222222222222222222222222222222\x00value:3\n" b"1111111111111111111111111111111111111111\x004444444444444444444444444444444444444444\x00\t1111111111111111111111111111111111111111\x00ref0000000000000000000000000000000000000000\x00value:4\n" b"" ) self.assertEqual(expected_node, node_bytes) def test_2_leaves_1_0(self): builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0) nodes = self.make_nodes(400, 1, 0) for node in nodes: builder.add_node(*node) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqualApproxCompressed(9283, len(content)) self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=0\nkey_elements=1\nlen=400\n" b"row_lengths=1,2\n", content[:77], ) root = content[77:4096] leaf1 = content[4096:8192] leaf2 = content[8192:] root_bytes = zlib.decompress(root) expected_root = (b"type=internal\noffset=0\n") + (b"307" * 40) + b"\n" self.assertEqual(expected_root, root_bytes) # We already know serialisation works for leaves, check key selection: leaf1_bytes = zlib.decompress(leaf1) sorted_node_keys = sorted(node[0] for node in nodes) node = btree_index._LeafNode(leaf1_bytes, 1, 0) self.assertEqual(231, len(node)) self.assertEqual(sorted_node_keys[:231], node.all_keys()) leaf2_bytes = zlib.decompress(leaf2) node = btree_index._LeafNode(leaf2_bytes, 1, 0) self.assertEqual(400 - 231, len(node)) self.assertEqual(sorted_node_keys[231:], node.all_keys()) def test_last_page_rounded_1_layer(self): builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0) nodes = self.make_nodes(10, 1, 0) for node in nodes: builder.add_node(*node) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqualApproxCompressed(155, len(content)) self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=0\nkey_elements=1\nlen=10\n" b"row_lengths=1\n", content[:74], ) # Check thelast page is well formed leaf2 = content[74:] leaf2_bytes = zlib.decompress(leaf2) node = btree_index._LeafNode(leaf2_bytes, 1, 0) self.assertEqual(10, len(node)) sorted_node_keys = sorted(node[0] for node in nodes) self.assertEqual(sorted_node_keys, node.all_keys()) def test_last_page_not_rounded_2_layer(self): builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0) nodes = self.make_nodes(400, 1, 0) for node in nodes: builder.add_node(*node) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqualApproxCompressed(9283, len(content)) self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=0\nkey_elements=1\nlen=400\n" b"row_lengths=1,2\n", content[:77], ) # Check the last page is well formed leaf2 = content[8192:] leaf2_bytes = zlib.decompress(leaf2) node = btree_index._LeafNode(leaf2_bytes, 1, 0) self.assertEqual(400 - 231, len(node)) sorted_node_keys = sorted(node[0] for node in nodes) self.assertEqual(sorted_node_keys[231:], node.all_keys()) def test_three_level_tree_details(self): # The left most pointer in the second internal node in a row should # pointer to the second node that the internal node is for, _not_ # the first, otherwise the first node overlaps with the last node of # the prior internal node on that row. self.shrink_page_size() builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) # 40K nodes is enough to create a two internal nodes on the second # level, with a 2K page size nodes = self.make_nodes(20000, 2, 2) for node in nodes: builder.add_node(*node) t = TracingTransport(self.get_transport("")) size = t.put_file("index", self.time(builder.finish)) del builder index = btree_index.BTreeGraphIndex(t, "index", size) # Seed the metadata, we're using internal calls now. index.key_count() self.assertEqual( 3, len(index._row_lengths), f"Not enough rows: {index._row_lengths!r}" ) self.assertEqual(4, len(index._row_offsets)) self.assertEqual(sum(index._row_lengths), index._row_offsets[-1]) internal_nodes = index._get_internal_nodes([0, 1, 2]) internal_nodes[0] internal_node1 = internal_nodes[1] internal_node2 = internal_nodes[2] # The left most node node2 points at should be one after the right most # node pointed at by node1. self.assertEqual(internal_node2.offset, 1 + len(internal_node1.keys)) # The left most key of the second node pointed at by internal_node2 # should be its first key. We can check this by looking for its first key # in the second node it points at pos = index._row_offsets[2] + internal_node2.offset + 1 leaf = index._get_leaf_nodes([pos])[pos] self.assertIn(internal_node2.keys[0], leaf) def test_2_leaves_2_2(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) nodes = self.make_nodes(100, 2, 2) for node in nodes: builder.add_node(*node) # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file self.assertEqualApproxCompressed(12643, len(content)) self.assertEqual( b"B+Tree Graph Index 2\nnode_ref_lists=2\nkey_elements=2\nlen=200\n" b"row_lengths=1,3\n", content[:77], ) root = content[77:4096] content[4096:8192] content[8192:12288] content[12288:] root_bytes = zlib.decompress(root) expected_root = ( b"type=internal\n" b"offset=0\n" + (b"0" * 40) + b"\x00" + (b"91" * 40) + b"\n" + (b"1" * 40) + b"\x00" + (b"81" * 40) + b"\n" ) self.assertEqual(expected_root, root_bytes) # We assume the other leaf nodes have been written correctly - layering # FTW. def test_spill_index_stress_1_1(self): builder = btree_index.BTreeBuilder(key_elements=1, spill_at=2) nodes = [node[0:2] for node in self.make_nodes(16, 1, 0)] builder.add_node(*nodes[0]) # Test the parts of the index that take up memory are doing so # predictably. self.assertEqual(1, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) builder.add_node(*nodes[1]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(1, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) # now back to memory builder.add_node(*nodes[2]) self.assertEqual(1, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) # And spills to a second backing index combing all builder.add_node(*nodes[3]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(2, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(4, builder._backing_indices[1].key_count()) # The next spills to the 2-len slot builder.add_node(*nodes[4]) builder.add_node(*nodes[5]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(2, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) self.assertEqual(4, builder._backing_indices[1].key_count()) # Next spill combines builder.add_node(*nodes[6]) builder.add_node(*nodes[7]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(None, builder._backing_indices[1]) self.assertEqual(8, builder._backing_indices[2].key_count()) # And so forth - counting up in binary. builder.add_node(*nodes[8]) builder.add_node(*nodes[9]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) self.assertEqual(None, builder._backing_indices[1]) self.assertEqual(8, builder._backing_indices[2].key_count()) builder.add_node(*nodes[10]) builder.add_node(*nodes[11]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(4, builder._backing_indices[1].key_count()) self.assertEqual(8, builder._backing_indices[2].key_count()) builder.add_node(*nodes[12]) # Test that memory and disk are both used for query methods; and that # None is skipped over happily. self.assertEqual( [(builder,) + node for node in sorted(nodes[:13])], list(builder.iter_all_entries()), ) # Two nodes - one memory one disk self.assertEqual( {(builder,) + node for node in nodes[11:13]}, set(builder.iter_entries([nodes[12][0], nodes[11][0]])), ) self.assertEqual(13, builder.key_count()) self.assertEqual( {(builder,) + node for node in nodes[11:13]}, set(builder.iter_entries_prefix([nodes[12][0], nodes[11][0]])), ) builder.add_node(*nodes[13]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) self.assertEqual(4, builder._backing_indices[1].key_count()) self.assertEqual(8, builder._backing_indices[2].key_count()) builder.add_node(*nodes[14]) builder.add_node(*nodes[15]) self.assertEqual(4, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(None, builder._backing_indices[1]) self.assertEqual(None, builder._backing_indices[2]) self.assertEqual(16, builder._backing_indices[3].key_count()) # Now finish, and check we got a correctly ordered tree t = self.get_transport("") size = t.put_file("index", builder.finish()) index = btree_index.BTreeGraphIndex(t, "index", size) nodes = list(index.iter_all_entries()) self.assertEqual(sorted(nodes), nodes) self.assertEqual(16, len(nodes)) def test_spill_index_stress_1_1_no_combine(self): builder = btree_index.BTreeBuilder(key_elements=1, spill_at=2) builder.set_optimize(for_size=False, combine_backing_indices=False) nodes = [node[0:2] for node in self.make_nodes(16, 1, 0)] builder.add_node(*nodes[0]) # Test the parts of the index that take up memory are doing so # predictably. self.assertEqual(1, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) builder.add_node(*nodes[1]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(1, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) # now back to memory builder.add_node(*nodes[2]) self.assertEqual(1, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) # And spills to a second backing index but doesn't combine builder.add_node(*nodes[3]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(2, len(builder._backing_indices)) for backing_index in builder._backing_indices: self.assertEqual(2, backing_index.key_count()) # The next spills to the 3rd slot builder.add_node(*nodes[4]) builder.add_node(*nodes[5]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(3, len(builder._backing_indices)) for backing_index in builder._backing_indices: self.assertEqual(2, backing_index.key_count()) # Now spill a few more, and check that we don't combine builder.add_node(*nodes[6]) builder.add_node(*nodes[7]) builder.add_node(*nodes[8]) builder.add_node(*nodes[9]) builder.add_node(*nodes[10]) builder.add_node(*nodes[11]) builder.add_node(*nodes[12]) self.assertEqual(6, len(builder._backing_indices)) for backing_index in builder._backing_indices: self.assertEqual(2, backing_index.key_count()) # Test that memory and disk are both used for query methods; and that # None is skipped over happily. self.assertEqual( [(builder,) + node for node in sorted(nodes[:13])], list(builder.iter_all_entries()), ) # Two nodes - one memory one disk self.assertEqual( {(builder,) + node for node in nodes[11:13]}, set(builder.iter_entries([nodes[12][0], nodes[11][0]])), ) self.assertEqual(13, builder.key_count()) self.assertEqual( {(builder,) + node for node in nodes[11:13]}, set(builder.iter_entries_prefix([nodes[12][0], nodes[11][0]])), ) builder.add_node(*nodes[13]) builder.add_node(*nodes[14]) builder.add_node(*nodes[15]) self.assertEqual(8, len(builder._backing_indices)) for backing_index in builder._backing_indices: self.assertEqual(2, backing_index.key_count()) # Now finish, and check we got a correctly ordered tree transport = self.get_transport("") size = transport.put_file("index", builder.finish()) index = btree_index.BTreeGraphIndex(transport, "index", size) nodes = list(index.iter_all_entries()) self.assertEqual(sorted(nodes), nodes) self.assertEqual(16, len(nodes)) def test_set_optimize(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) builder.set_optimize(for_size=True) self.assertTrue(builder._optimize_for_size) builder.set_optimize(for_size=False) self.assertFalse(builder._optimize_for_size) # test that we can set combine_backing_indices without effecting # _optimize_for_size obj = object() builder._optimize_for_size = obj builder.set_optimize(combine_backing_indices=False) self.assertFalse(builder._combine_backing_indices) self.assertIs(obj, builder._optimize_for_size) builder.set_optimize(combine_backing_indices=True) self.assertTrue(builder._combine_backing_indices) self.assertIs(obj, builder._optimize_for_size) def test_spill_index_stress_2_2(self): # test that references and longer keys don't confuse things. builder = btree_index.BTreeBuilder( key_elements=2, reference_lists=2, spill_at=2 ) nodes = self.make_nodes(16, 2, 2) builder.add_node(*nodes[0]) # Test the parts of the index that take up memory are doing so # predictably. self.assertEqual(1, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) builder.add_node(*nodes[1]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(1, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) # now back to memory # Build up the nodes by key dict old = dict(builder._get_nodes_by_key()) builder.add_node(*nodes[2]) self.assertEqual(1, len(builder._nodes)) self.assertIsNot(None, builder._nodes_by_key) self.assertNotEqual({}, builder._nodes_by_key) # We should have a new entry self.assertNotEqual(old, builder._nodes_by_key) # And spills to a second backing index combing all builder.add_node(*nodes[3]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(2, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(4, builder._backing_indices[1].key_count()) # The next spills to the 2-len slot builder.add_node(*nodes[4]) builder.add_node(*nodes[5]) self.assertEqual(0, len(builder._nodes)) self.assertIs(None, builder._nodes_by_key) self.assertEqual(2, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) self.assertEqual(4, builder._backing_indices[1].key_count()) # Next spill combines builder.add_node(*nodes[6]) builder.add_node(*nodes[7]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(None, builder._backing_indices[1]) self.assertEqual(8, builder._backing_indices[2].key_count()) # And so forth - counting up in binary. builder.add_node(*nodes[8]) builder.add_node(*nodes[9]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) self.assertEqual(None, builder._backing_indices[1]) self.assertEqual(8, builder._backing_indices[2].key_count()) builder.add_node(*nodes[10]) builder.add_node(*nodes[11]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(4, builder._backing_indices[1].key_count()) self.assertEqual(8, builder._backing_indices[2].key_count()) builder.add_node(*nodes[12]) # Test that memory and disk are both used for query methods; and that # None is skipped over happily. self.assertEqual( [(builder,) + node for node in sorted(nodes[:13])], list(builder.iter_all_entries()), ) # Two nodes - one memory one disk self.assertEqual( {(builder,) + node for node in nodes[11:13]}, set(builder.iter_entries([nodes[12][0], nodes[11][0]])), ) self.assertEqual(13, builder.key_count()) self.assertEqual( {(builder,) + node for node in nodes[11:13]}, set(builder.iter_entries_prefix([nodes[12][0], nodes[11][0]])), ) builder.add_node(*nodes[13]) self.assertEqual(3, len(builder._backing_indices)) self.assertEqual(2, builder._backing_indices[0].key_count()) self.assertEqual(4, builder._backing_indices[1].key_count()) self.assertEqual(8, builder._backing_indices[2].key_count()) builder.add_node(*nodes[14]) builder.add_node(*nodes[15]) self.assertEqual(4, len(builder._backing_indices)) self.assertEqual(None, builder._backing_indices[0]) self.assertEqual(None, builder._backing_indices[1]) self.assertEqual(None, builder._backing_indices[2]) self.assertEqual(16, builder._backing_indices[3].key_count()) # Now finish, and check we got a correctly ordered tree transport = self.get_transport("") size = transport.put_file("index", builder.finish()) index = btree_index.BTreeGraphIndex(transport, "index", size) nodes = list(index.iter_all_entries()) self.assertEqual(sorted(nodes), nodes) self.assertEqual(16, len(nodes)) def test_spill_index_duplicate_key_caught_on_finish(self): builder = btree_index.BTreeBuilder(key_elements=1, spill_at=2) nodes = [node[0:2] for node in self.make_nodes(16, 1, 0)] builder.add_node(*nodes[0]) builder.add_node(*nodes[1]) builder.add_node(*nodes[0]) self.assertRaises(_mod_index.BadIndexDuplicateKey, builder.finish) class TestBTreeIndex(BTreeTestCase): def make_index(self, ref_lists=0, key_elements=1, nodes=None): if nodes is None: nodes = [] builder = btree_index.BTreeBuilder( reference_lists=ref_lists, key_elements=key_elements ) for key, value, references in nodes: builder.add_node(key, value, references) stream = builder.finish() trans = TracingTransport(self.get_transport()) size = trans.put_file("index", stream) return btree_index.BTreeGraphIndex(trans, "index", size) def make_index_with_offset(self, ref_lists=1, key_elements=1, nodes=None, offset=0): if nodes is None: nodes = [] builder = btree_index.BTreeBuilder( key_elements=key_elements, reference_lists=ref_lists ) builder.add_nodes(nodes) transport = self.get_transport("") # NamedTemporaryFile dies on builder.finish().read(). weird. temp_file = builder.finish() content = temp_file.read() del temp_file size = len(content) transport.put_bytes("index", (b" " * offset) + content) return btree_index.BTreeGraphIndex(transport, "index", size=size, offset=offset) def test_clear_cache(self): nodes = self.make_nodes(160, 2, 2) index = self.make_index(ref_lists=2, key_elements=2, nodes=nodes) self.assertEqual(1, len(list(index.iter_entries([nodes[30][0]])))) self.assertEqual([1, 4], index._row_lengths) self.assertIsNot(None, index._root_node) internal_node_pre_clear = set(index._internal_node_cache) self.assertGreater(len(index._leaf_node_cache), 0) index.clear_cache() # We don't touch _root_node or _internal_node_cache, both should be # small, and can save a round trip or two self.assertIsNot(None, index._root_node) # NOTE: We don't want to affect the _internal_node_cache, as we expect # it will be small, and if we ever do touch this index again, it # will save round-trips. This assertion isn't very strong, # becuase without a 3-level index, we don't have any internal # nodes cached. self.assertEqual(internal_node_pre_clear, set(index._internal_node_cache)) self.assertEqual(0, len(index._leaf_node_cache)) def test_trivial_constructor(self): t = TracingTransport(self.get_transport("")) btree_index.BTreeGraphIndex(t, "index", None) # Checks the page size at load, but that isn't logged yet. self.assertEqual([], t._activity) def test_with_size_constructor(self): t = TracingTransport(self.get_transport("")) btree_index.BTreeGraphIndex(t, "index", 1) # Checks the page size at load, but that isn't logged yet. self.assertEqual([], t._activity) def test_empty_key_count_no_size(self): builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0) t = TracingTransport(self.get_transport("")) t.put_file("index", builder.finish()) index = btree_index.BTreeGraphIndex(t, "index", None) del t._activity[:] self.assertEqual([], t._activity) self.assertEqual(0, index.key_count()) # The entire index should have been requested (as we generally have the # size available, and doing many small readvs is inappropriate). # We can't tell how much was actually read here, but - check the code. self.assertEqual([("get", "index")], t._activity) def test_empty_key_count(self): builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0) t = TracingTransport(self.get_transport("")) size = t.put_file("index", builder.finish()) self.assertEqual(72, size) index = btree_index.BTreeGraphIndex(t, "index", size) del t._activity[:] self.assertEqual([], t._activity) self.assertEqual(0, index.key_count()) # The entire index should have been read, as 4K > size self.assertEqual([("readv", "index", [(0, 72)], False, None)], t._activity) def test_non_empty_key_count_2_2(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) nodes = self.make_nodes(35, 2, 2) for node in nodes: builder.add_node(*node) t = TracingTransport(self.get_transport("")) size = t.put_file("index", builder.finish()) index = btree_index.BTreeGraphIndex(t, "index", size) del t._activity[:] self.assertEqual([], t._activity) self.assertEqual(70, index.key_count()) # The entire index should have been read, as it is one page long. self.assertEqual([("readv", "index", [(0, size)], False, None)], t._activity) self.assertEqualApproxCompressed(1173, size) def test_with_offset_no_size(self): index = self.make_index_with_offset( key_elements=1, ref_lists=1, offset=1234, nodes=self.make_nodes(200, 1, 1) ) index._size = None # throw away the size info self.assertEqual(200, index.key_count()) def test_with_small_offset(self): index = self.make_index_with_offset( key_elements=1, ref_lists=1, offset=1234, nodes=self.make_nodes(200, 1, 1) ) self.assertEqual(200, index.key_count()) def test_with_large_offset(self): index = self.make_index_with_offset( key_elements=1, ref_lists=1, offset=123456, nodes=self.make_nodes(200, 1, 1) ) self.assertEqual(200, index.key_count()) def test__read_nodes_no_size_one_page_reads_once(self): self.make_index(nodes=[((b"key",), b"value", ())]) trans = TracingTransport(self.get_transport()) index = btree_index.BTreeGraphIndex(trans, "index", None) del trans._activity[:] nodes = dict(index._read_nodes([0])) self.assertEqual({0}, set(nodes)) node = nodes[0] self.assertEqual([(b"key",)], node.all_keys()) self.assertEqual([("get", "index")], trans._activity) def test__read_nodes_no_size_multiple_pages(self): index = self.make_index(2, 2, nodes=self.make_nodes(160, 2, 2)) index.key_count() num_pages = index._row_offsets[-1] # Reopen with a traced transport and no size trans = TracingTransport(self.get_transport()) index = btree_index.BTreeGraphIndex(trans, "index", None) del trans._activity[:] nodes = dict(index._read_nodes([0])) self.assertEqual(list(range(num_pages)), sorted(nodes)) def test_2_levels_key_count_2_2(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) nodes = self.make_nodes(160, 2, 2) for node in nodes: builder.add_node(*node) t = TracingTransport(self.get_transport("")) size = t.put_file("index", builder.finish()) self.assertEqualApproxCompressed(17692, size) index = btree_index.BTreeGraphIndex(t, "index", size) del t._activity[:] self.assertEqual([], t._activity) self.assertEqual(320, index.key_count()) # The entire index should not have been read. self.assertEqual([("readv", "index", [(0, 4096)], False, None)], t._activity) def test_validate_one_page(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) nodes = self.make_nodes(45, 2, 2) for node in nodes: builder.add_node(*node) t = TracingTransport(self.get_transport("")) size = t.put_file("index", builder.finish()) index = btree_index.BTreeGraphIndex(t, "index", size) del t._activity[:] self.assertEqual([], t._activity) index.validate() # The entire index should have been read linearly. self.assertEqual([("readv", "index", [(0, size)], False, None)], t._activity) self.assertEqualApproxCompressed(1488, size) def test_validate_two_pages(self): builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) nodes = self.make_nodes(80, 2, 2) for node in nodes: builder.add_node(*node) t = TracingTransport(self.get_transport("")) size = t.put_file("index", builder.finish()) # Root page, 2 leaf pages self.assertEqualApproxCompressed(9339, size) index = btree_index.BTreeGraphIndex(t, "index", size) del t._activity[:] self.assertEqual([], t._activity) index.validate() rem = size - 8192 # Number of remaining bytes after second block # The entire index should have been read linearly. self.assertEqual( [ ("readv", "index", [(0, 4096)], False, None), ("readv", "index", [(4096, 4096), (8192, rem)], False, None), ], t._activity, ) # XXX: TODO: write some badly-ordered nodes, and some pointers-to-wrong # node and make validate find them. def test_eq_ne(self): # two indices are equal when constructed with the same parameters: t1 = TracingTransport(self.get_transport("")) t2 = self.get_transport() self.assertEqual( btree_index.BTreeGraphIndex(t1, "index", None), btree_index.BTreeGraphIndex(t1, "index", None), ) self.assertEqual( btree_index.BTreeGraphIndex(t1, "index", 20), btree_index.BTreeGraphIndex(t1, "index", 20), ) self.assertNotEqual( btree_index.BTreeGraphIndex(t1, "index", 20), btree_index.BTreeGraphIndex(t2, "index", 20), ) self.assertNotEqual( btree_index.BTreeGraphIndex(t1, "inde1", 20), btree_index.BTreeGraphIndex(t1, "inde2", 20), ) self.assertNotEqual( btree_index.BTreeGraphIndex(t1, "index", 10), btree_index.BTreeGraphIndex(t1, "index", 20), ) self.assertEqual( btree_index.BTreeGraphIndex(t1, "index", None), btree_index.BTreeGraphIndex(t1, "index", None), ) self.assertEqual( btree_index.BTreeGraphIndex(t1, "index", 20), btree_index.BTreeGraphIndex(t1, "index", 20), ) self.assertNotEqual( btree_index.BTreeGraphIndex(t1, "index", 20), btree_index.BTreeGraphIndex(t2, "index", 20), ) self.assertNotEqual( btree_index.BTreeGraphIndex(t1, "inde1", 20), btree_index.BTreeGraphIndex(t1, "inde2", 20), ) self.assertNotEqual( btree_index.BTreeGraphIndex(t1, "index", 10), btree_index.BTreeGraphIndex(t1, "index", 20), ) def test_key_too_big(self): # the size that matters here is the _compressed_ size of the key, so we can't # do a simple character repeat. bigKey = b"".join(b"%d" % n for n in range(btree_index._PAGE_SIZE)) self.assertRaises( _mod_index.BadIndexKey, self.make_index, nodes=[((bigKey,), b"value", ())] ) def test_iter_all_only_root_no_size(self): self.make_index(nodes=[((b"key",), b"value", ())]) t = TracingTransport(self.get_transport("")) index = btree_index.BTreeGraphIndex(t, "index", None) del t._activity[:] self.assertEqual( [((b"key",), b"value")], [x[1:] for x in index.iter_all_entries()] ) self.assertEqual([("get", "index")], t._activity) def test_iter_all_entries_reads(self): # iterating all entries reads the header, then does a linear # read. self.shrink_page_size() builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) # 20k nodes is enough to create a two internal nodes on the second # level, with a 2K page size nodes = self.make_nodes(10000, 2, 2) for node in nodes: builder.add_node(*node) t = TracingTransport(self.get_transport("")) size = t.put_file("index", builder.finish()) page_size = btree_index._PAGE_SIZE del builder index = btree_index.BTreeGraphIndex(t, "index", size) del t._activity[:] self.assertEqual([], t._activity) found_nodes = self.time(list, index.iter_all_entries()) bare_nodes = [] for node in found_nodes: self.assertIs(node[0], index) bare_nodes.append(node[1:]) self.assertEqual( 3, len(index._row_lengths), f"Not enough rows: {index._row_lengths!r}" ) # Should be as long as the nodes we supplied self.assertEqual(20000, len(found_nodes)) # Should have the same content self.assertEqual(set(nodes), set(bare_nodes)) # Should have done linear scan IO up the index, ignoring # the internal nodes: # The entire index should have been read total_pages = sum(index._row_lengths) self.assertEqual(total_pages, index._row_offsets[-1]) self.assertEqualApproxCompressed(1303220, size) # The start of the leaves first_byte = index._row_offsets[-2] * page_size readv_request = [] for offset in range(first_byte, size, page_size): readv_request.append((offset, page_size)) # The last page is truncated readv_request[-1] = (readv_request[-1][0], size % page_size) expected = [ ("readv", "index", [(0, page_size)], False, None), ("readv", "index", readv_request, False, None), ] if expected != t._activity: self.assertEqualDiff(pprint.pformat(expected), pprint.pformat(t._activity)) def test_iter_entries_references_2_refs_resolved(self): # iterating some entries reads just the pages needed. For now, to # get it working and start measuring, only 4K pages are read. builder = btree_index.BTreeBuilder(key_elements=2, reference_lists=2) # 80 nodes is enough to create a two-level index. nodes = self.make_nodes(160, 2, 2) for node in nodes: builder.add_node(*node) t = TracingTransport(self.get_transport("")) size = t.put_file("index", builder.finish()) del builder index = btree_index.BTreeGraphIndex(t, "index", size) del t._activity[:] self.assertEqual([], t._activity) # search for one key found_nodes = list(index.iter_entries([nodes[30][0]])) bare_nodes = [] for node in found_nodes: self.assertIs(node[0], index) bare_nodes.append(node[1:]) # Should be as long as the nodes we supplied self.assertEqual(1, len(found_nodes)) # Should have the same content self.assertEqual(nodes[30], bare_nodes[0]) # Should have read the root node, then one leaf page: self.assertEqual( [ ("readv", "index", [(0, 4096)], False, None), ( "readv", "index", [ (8192, 4096), ], False, None, ), ], t._activity, ) def test_iter_key_prefix_1_element_key_None(self): index = self.make_index() self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(None,)]) ) def test_iter_key_prefix_wrong_length(self): index = self.make_index() self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(b"foo", None)]) ) index = self.make_index(key_elements=2) self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(b"foo",)]) ) self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(b"foo", None, None)]), ) def test_iter_key_prefix_1_key_element_no_refs(self): index = self.make_index( nodes=[((b"name",), b"data", ()), ((b"ref",), b"refdata", ())] ) self.assertEqual( {(index, (b"name",), b"data"), (index, (b"ref",), b"refdata")}, set(index.iter_entries_prefix([(b"name",), (b"ref",)])), ) def test_iter_key_prefix_1_key_element_refs(self): index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_entries_prefix([(b"name",), (b"ref",)])), ) def test_iter_key_prefix_2_key_element_no_refs(self): index = self.make_index( key_elements=2, nodes=[ ((b"name", b"fin1"), b"data", ()), ((b"name", b"fin2"), b"beta", ()), ((b"ref", b"erence"), b"refdata", ()), ], ) self.assertEqual( { (index, (b"name", b"fin1"), b"data"), (index, (b"ref", b"erence"), b"refdata"), }, set(index.iter_entries_prefix([(b"name", b"fin1"), (b"ref", b"erence")])), ) self.assertEqual( { (index, (b"name", b"fin1"), b"data"), (index, (b"name", b"fin2"), b"beta"), }, set(index.iter_entries_prefix([(b"name", None)])), ) def test_iter_key_prefix_2_key_element_refs(self): index = self.make_index( 1, key_elements=2, nodes=[ ((b"name", b"fin1"), b"data", ([(b"ref", b"erence")],)), ((b"name", b"fin2"), b"beta", ([],)), ((b"ref", b"erence"), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (index, (b"ref", b"erence"), b"refdata", ((),)), }, set(index.iter_entries_prefix([(b"name", b"fin1"), (b"ref", b"erence")])), ) self.assertEqual( { (index, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (index, (b"name", b"fin2"), b"beta", ((),)), }, set(index.iter_entries_prefix([(b"name", None)])), ) # XXX: external_references tests are duplicated in test_index. We # probably should have per_graph_index tests... def test_external_references_no_refs(self): index = self.make_index(ref_lists=0, nodes=[]) self.assertRaises(ValueError, index.external_references, 0) def test_external_references_no_results(self): index = self.make_index(ref_lists=1, nodes=[((b"key",), b"value", ([],))]) self.assertEqual(set(), index.external_references(0)) def test_external_references_missing_ref(self): missing_key = (b"missing",) index = self.make_index( ref_lists=1, nodes=[((b"key",), b"value", ([missing_key],))] ) self.assertEqual({missing_key}, index.external_references(0)) def test_external_references_multiple_ref_lists(self): missing_key = (b"missing",) index = self.make_index( ref_lists=2, nodes=[((b"key",), b"value", ([], [missing_key]))] ) self.assertEqual(set(), index.external_references(0)) self.assertEqual({missing_key}, index.external_references(1)) def test_external_references_two_records(self): index = self.make_index( ref_lists=1, nodes=[ ((b"key-1",), b"value", ([(b"key-2",)],)), ((b"key-2",), b"value", ([],)), ], ) self.assertEqual(set(), index.external_references(0)) def test__find_ancestors_one_page(self): key1 = (b"key-1",) key2 = (b"key-2",) index = self.make_index( ref_lists=1, key_elements=1, nodes=[ (key1, b"value", ([key2],)), (key2, b"value", ([],)), ], ) parent_map = {} missing_keys = set() search_keys = index._find_ancestors([key1], 0, parent_map, missing_keys) self.assertEqual({key1: (key2,), key2: ()}, parent_map) self.assertEqual(set(), missing_keys) self.assertEqual(set(), search_keys) def test__find_ancestors_one_page_w_missing(self): key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) index = self.make_index( ref_lists=1, key_elements=1, nodes=[ (key1, b"value", ([key2],)), (key2, b"value", ([],)), ], ) parent_map = {} missing_keys = set() search_keys = index._find_ancestors([key2, key3], 0, parent_map, missing_keys) self.assertEqual({key2: ()}, parent_map) # we know that key3 is missing because we read the page that it would # otherwise be on self.assertEqual({key3}, missing_keys) self.assertEqual(set(), search_keys) def test__find_ancestors_one_parent_missing(self): key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) index = self.make_index( ref_lists=1, key_elements=1, nodes=[ (key1, b"value", ([key2],)), (key2, b"value", ([key3],)), ], ) parent_map = {} missing_keys = set() search_keys = index._find_ancestors([key1], 0, parent_map, missing_keys) self.assertEqual({key1: (key2,), key2: (key3,)}, parent_map) self.assertEqual(set(), missing_keys) # all we know is that key3 wasn't present on the page we were reading # but if you look, the last key is key2 which comes before key3, so we # don't know whether key3 would land on this page or not. self.assertEqual({key3}, search_keys) search_keys = index._find_ancestors(search_keys, 0, parent_map, missing_keys) # passing it back in, we are sure it is 'missing' self.assertEqual({key1: (key2,), key2: (key3,)}, parent_map) self.assertEqual({key3}, missing_keys) self.assertEqual(set(), search_keys) def test__find_ancestors_dont_search_known(self): key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) index = self.make_index( ref_lists=1, key_elements=1, nodes=[ (key1, b"value", ([key2],)), (key2, b"value", ([key3],)), (key3, b"value", ([],)), ], ) # We already know about key2, so we won't try to search for key3 parent_map = {key2: (key3,)} missing_keys = set() search_keys = index._find_ancestors([key1], 0, parent_map, missing_keys) self.assertEqual({key1: (key2,), key2: (key3,)}, parent_map) self.assertEqual(set(), missing_keys) self.assertEqual(set(), search_keys) def test__find_ancestors_multiple_pages(self): # We need to use enough keys that we actually cause a split start_time = 1249671539 email = "joebob@example.com" nodes = [] ref_lists = ((),) rev_keys = [] for i in range(400): rev_id = ( "{}-{}-{}".format( email, time.strftime("%Y%m%d%H%M%S", time.gmtime(start_time + i)), osutils.rand_chars(16), ) ).encode("ascii") rev_key = (rev_id,) nodes.append((rev_key, b"value", ref_lists)) # We have a ref 'list' of length 1, with a list of parents, with 1 # parent which is a key ref_lists = ((rev_key,),) rev_keys.append(rev_key) index = self.make_index(ref_lists=1, key_elements=1, nodes=nodes) self.assertEqual(400, index.key_count()) self.assertEqual(3, len(index._row_offsets)) nodes = dict(index._read_nodes([1, 2])) l1 = nodes[1] l2 = nodes[2] min_l2_key = l2.min_key max_l1_key = l1.max_key self.assertLess(max_l1_key, min_l2_key) parents_min_l2_key = l2[min_l2_key][1][0] self.assertEqual((l1.max_key,), parents_min_l2_key) # Now, whatever key we select that would fall on the second page, # should give us all the parents until the page break key_idx = rev_keys.index(min_l2_key) next_key = rev_keys[key_idx + 1] # So now when we get the parent map, we should get the key we are # looking for, min_l2_key, and then a reference to go look for the # parent of that key parent_map = {} missing_keys = set() search_keys = index._find_ancestors([next_key], 0, parent_map, missing_keys) self.assertEqual([min_l2_key, next_key], sorted(parent_map)) self.assertEqual(set(), missing_keys) self.assertEqual({max_l1_key}, search_keys) parent_map = {} search_keys = index._find_ancestors([max_l1_key], 0, parent_map, missing_keys) self.assertEqual(l1.all_keys(), sorted(parent_map)) self.assertEqual(set(), missing_keys) self.assertEqual(set(), search_keys) def test__find_ancestors_empty_index(self): index = self.make_index(ref_lists=1, key_elements=1, nodes=[]) parent_map = {} missing_keys = set() search_keys = index._find_ancestors( [("one",), ("two",)], 0, parent_map, missing_keys ) self.assertEqual(set(), search_keys) self.assertEqual({}, parent_map) self.assertEqual({("one",), ("two",)}, missing_keys) def test_supports_unlimited_cache(self): builder = btree_index.BTreeBuilder(reference_lists=0, key_elements=1) # We need enough nodes to cause a page split (so we have both an # internal node and a couple leaf nodes. 500 seems to be enough.) nodes = self.make_nodes(500, 1, 0) for node in nodes: builder.add_node(*node) stream = builder.finish() trans = self.get_transport() size = trans.put_file("index", stream) index = btree_index.BTreeGraphIndex(trans, "index", size) self.assertEqual(500, index.key_count()) # We have an internal node self.assertEqual(2, len(index._row_lengths)) # We have at least 2 leaf nodes self.assertGreaterEqual(index._row_lengths[-1], 2) self.assertIsInstance(index._leaf_node_cache, lru_cache.LRUCache) self.assertEqual( btree_index._NODE_CACHE_SIZE, index._leaf_node_cache._max_cache ) self.assertIsInstance(index._internal_node_cache, FIFOCache) self.assertEqual(100, index._internal_node_cache._max_cache) # No change if unlimited_cache=False is passed index = btree_index.BTreeGraphIndex(trans, "index", size, unlimited_cache=False) self.assertIsInstance(index._leaf_node_cache, lru_cache.LRUCache) self.assertEqual( btree_index._NODE_CACHE_SIZE, index._leaf_node_cache._max_cache ) self.assertIsInstance(index._internal_node_cache, FIFOCache) self.assertEqual(100, index._internal_node_cache._max_cache) index = btree_index.BTreeGraphIndex(trans, "index", size, unlimited_cache=True) self.assertIsInstance(index._leaf_node_cache, dict) self.assertIs(type(index._internal_node_cache), dict) # Exercise the lookup code entries = set(index.iter_entries([n[0] for n in nodes])) self.assertEqual(500, len(entries)) class TestBTreeNodes(BTreeTestCase): scenarios = btreeparser_scenarios() def setUp(self): super().setUp() self.overrideAttr(btree_index, "_btree_serializer", self.parse_btree) def test_LeafNode_1_0(self): node_bytes = ( b"type=leaf\n" b"0000000000000000000000000000000000000000\x00\x00value:0\n" b"1111111111111111111111111111111111111111\x00\x00value:1\n" b"2222222222222222222222222222222222222222\x00\x00value:2\n" b"3333333333333333333333333333333333333333\x00\x00value:3\n" b"4444444444444444444444444444444444444444\x00\x00value:4\n" ) node = btree_index._LeafNode(node_bytes, 1, 0) # We do direct access, or don't care about order, to leaf nodes most of # the time, so a dict is useful: self.assertEqual( { (b"0000000000000000000000000000000000000000",): (b"value:0", ()), (b"1111111111111111111111111111111111111111",): (b"value:1", ()), (b"2222222222222222222222222222222222222222",): (b"value:2", ()), (b"3333333333333333333333333333333333333333",): (b"value:3", ()), (b"4444444444444444444444444444444444444444",): (b"value:4", ()), }, dict(node.all_items()), ) def test_LeafNode_2_2(self): node_bytes = ( b"type=leaf\n" b"00\x0000\x00\t00\x00ref00\x00value:0\n" b"00\x0011\x0000\x00ref00\t00\x00ref00\r01\x00ref01\x00value:1\n" b"11\x0033\x0011\x00ref22\t11\x00ref22\r11\x00ref22\x00value:3\n" b"11\x0044\x00\t11\x00ref00\x00value:4\n" b"" ) node = btree_index._LeafNode(node_bytes, 2, 2) # We do direct access, or don't care about order, to leaf nodes most of # the time, so a dict is useful: self.assertEqual( { (b"00", b"00"): (b"value:0", ((), ((b"00", b"ref00"),))), (b"00", b"11"): ( b"value:1", (((b"00", b"ref00"),), ((b"00", b"ref00"), (b"01", b"ref01"))), ), (b"11", b"33"): ( b"value:3", (((b"11", b"ref22"),), ((b"11", b"ref22"), (b"11", b"ref22"))), ), (b"11", b"44"): (b"value:4", ((), ((b"11", b"ref00"),))), }, dict(node.all_items()), ) def test_InternalNode_1(self): node_bytes = ( b"type=internal\n" b"offset=1\n" b"0000000000000000000000000000000000000000\n" b"1111111111111111111111111111111111111111\n" b"2222222222222222222222222222222222222222\n" b"3333333333333333333333333333333333333333\n" b"4444444444444444444444444444444444444444\n" ) node = btree_index._InternalNode(node_bytes) # We want to bisect to find the right children from this node, so a # vector is most useful. self.assertEqual( [ (b"0000000000000000000000000000000000000000",), (b"1111111111111111111111111111111111111111",), (b"2222222222222222222222222222222222222222",), (b"3333333333333333333333333333333333333333",), (b"4444444444444444444444444444444444444444",), ], node.keys, ) self.assertEqual(1, node.offset) def assertFlattened(self, expected, key, value, refs): flat_key, flat_line = self.parse_btree._flatten_node( (None, key, value, refs), bool(refs) ) self.assertEqual(b"\x00".join(key), flat_key) self.assertEqual(expected, flat_line) def test__flatten_node(self): self.assertFlattened(b"key\0\0value\n", (b"key",), b"value", []) self.assertFlattened( b"key\0tuple\0\0value str\n", (b"key", b"tuple"), b"value str", [] ) self.assertFlattened( b"key\0tuple\0triple\0\0value str\n", (b"key", b"tuple", b"triple"), b"value str", [], ) self.assertFlattened( b"k\0t\0s\0ref\0value str\n", (b"k", b"t", b"s"), b"value str", [[(b"ref",)]], ) self.assertFlattened( b"key\0tuple\0ref\0key\0value str\n", (b"key", b"tuple"), b"value str", [[(b"ref", b"key")]], ) self.assertFlattened( b"00\x0000\x00\t00\x00ref00\x00value:0\n", (b"00", b"00"), b"value:0", ((), ((b"00", b"ref00"),)), ) self.assertFlattened( b"00\x0011\x0000\x00ref00\t00\x00ref00\r01\x00ref01\x00value:1\n", (b"00", b"11"), b"value:1", (((b"00", b"ref00"),), ((b"00", b"ref00"), (b"01", b"ref01"))), ) self.assertFlattened( b"11\x0033\x0011\x00ref22\t11\x00ref22\r11\x00ref22\x00value:3\n", (b"11", b"33"), b"value:3", (((b"11", b"ref22"),), ((b"11", b"ref22"), (b"11", b"ref22"))), ) self.assertFlattened( b"11\x0044\x00\t11\x00ref00\x00value:4\n", (b"11", b"44"), b"value:4", ((), ((b"11", b"ref00"),)), ) class TestCompiledBtree(TestCase): def test_exists(self): # This is just to let the user know if they don't have the feature # available if _compiled_btreeparser_module is None: self.skipTest("bzrformats._btree_serializer_pyx not available") class TestMultiBisectRight(TestCase): def assertMultiBisectRight(self, offsets, search_keys, fixed_keys): self.assertEqual( offsets, btree_index.BTreeGraphIndex._multi_bisect_right(search_keys, fixed_keys), ) def test_after(self): self.assertMultiBisectRight([(1, ["b"])], ["b"], ["a"]) self.assertMultiBisectRight( [(3, ["e", "f", "g"])], ["e", "f", "g"], ["a", "b", "c"] ) def test_before(self): self.assertMultiBisectRight([(0, ["a"])], ["a"], ["b"]) self.assertMultiBisectRight( [(0, ["a", "b", "c", "d"])], ["a", "b", "c", "d"], ["e", "f", "g"] ) def test_exact(self): self.assertMultiBisectRight([(1, ["a"])], ["a"], ["a"]) self.assertMultiBisectRight([(1, ["a"]), (2, ["b"])], ["a", "b"], ["a", "b"]) self.assertMultiBisectRight( [(1, ["a"]), (3, ["c"])], ["a", "c"], ["a", "b", "c"] ) def test_inbetween(self): self.assertMultiBisectRight([(1, ["b"])], ["b"], ["a", "c"]) self.assertMultiBisectRight( [(1, ["b", "c", "d"]), (2, ["f", "g"])], ["b", "c", "d", "f", "g"], ["a", "e", "h"], ) def test_mixed(self): self.assertMultiBisectRight( [(0, ["a", "b"]), (2, ["d", "e"]), (4, ["g", "h"])], ["a", "b", "d", "e", "g", "h"], ["c", "d", "f", "g"], ) class TestExpandOffsets(TestCase): def make_index(self, size, recommended_pages=None): """Make an index with a generic size. This doesn't actually create anything on disk, it just primes a BTreeGraphIndex with the recommended information. """ index = btree_index.BTreeGraphIndex(MemoryTransport(), "test-index", size=size) if recommended_pages is not None: index._recommended_pages = recommended_pages return index def set_cached_offsets(self, index, cached_offsets): """Monkeypatch to give a canned answer for _get_offsets_for...().""" def _get_offsets_to_cached_pages(): cached = set(cached_offsets) return cached index._get_offsets_to_cached_pages = _get_offsets_to_cached_pages def prepare_index( self, index, node_ref_lists, key_length, key_count, row_lengths, cached_offsets ): """Setup the BTreeGraphIndex with some pre-canned information.""" index.node_ref_lists = node_ref_lists index._key_length = key_length index._key_count = key_count index._row_lengths = row_lengths index._compute_row_offsets() index._root_node = btree_index._InternalNode(b"internal\noffset=0\n") self.set_cached_offsets(index, cached_offsets) def make_100_node_index(self): index = self.make_index(4096 * 100, 6) # Consider we've already made a single request at the middle self.prepare_index( index, node_ref_lists=0, key_length=1, key_count=1000, row_lengths=[1, 99], cached_offsets=[0, 50], ) return index def make_1000_node_index(self): index = self.make_index(4096 * 1000, 6) # Pretend we've already made a single request in the middle self.prepare_index( index, node_ref_lists=0, key_length=1, key_count=90000, row_lengths=[1, 9, 990], cached_offsets=[0, 5, 500], ) return index def assertNumPages(self, expected_pages, index, size): index._size = size self.assertEqual(expected_pages, index._compute_total_pages_in_index()) def assertExpandOffsets(self, expected, index, offsets): self.assertEqual( expected, index._expand_offsets(offsets), f"We did not get the expected value after expanding {offsets}", ) def test_default_recommended_pages(self): index = self.make_index(None) # local transport recommends 4096 byte reads, which is 1 page self.assertEqual(1, index._recommended_pages) def test__compute_total_pages_in_index(self): index = self.make_index(None) self.assertNumPages(1, index, 1024) self.assertNumPages(1, index, 4095) self.assertNumPages(1, index, 4096) self.assertNumPages(2, index, 4097) self.assertNumPages(2, index, 8192) self.assertNumPages(76, index, 4096 * 75 + 10) def test__find_layer_start_and_stop(self): index = self.make_1000_node_index() self.assertEqual((0, 1), index._find_layer_first_and_end(0)) self.assertEqual((1, 10), index._find_layer_first_and_end(1)) self.assertEqual((1, 10), index._find_layer_first_and_end(9)) self.assertEqual((10, 1000), index._find_layer_first_and_end(10)) self.assertEqual((10, 1000), index._find_layer_first_and_end(99)) self.assertEqual((10, 1000), index._find_layer_first_and_end(999)) def test_unknown_size(self): # We should not expand if we don't know the file size index = self.make_index(None, 10) self.assertExpandOffsets([0], index, [0]) self.assertExpandOffsets([1, 4, 9], index, [1, 4, 9]) def test_more_than_recommended(self): index = self.make_index(4096 * 100, 2) self.assertExpandOffsets([1, 10], index, [1, 10]) self.assertExpandOffsets([1, 10, 20], index, [1, 10, 20]) def test_read_all_from_root(self): index = self.make_index(4096 * 10, 20) self.assertExpandOffsets(list(range(10)), index, [0]) def test_read_all_when_cached(self): # We've read enough that we can grab all the rest in a single request index = self.make_index(4096 * 10, 5) self.prepare_index( index, node_ref_lists=0, key_length=1, key_count=1000, row_lengths=[1, 9], cached_offsets=[0, 1, 2, 5, 6], ) # It should fill the remaining nodes, regardless of the one requested self.assertExpandOffsets([3, 4, 7, 8, 9], index, [3]) self.assertExpandOffsets([3, 4, 7, 8, 9], index, [8]) self.assertExpandOffsets([3, 4, 7, 8, 9], index, [9]) def test_no_root_node(self): index = self.make_index(4096 * 10, 5) self.assertExpandOffsets([0], index, [0]) def test_include_neighbors(self): index = self.make_100_node_index() # We expand in both directions, until we have at least 'recommended' # pages self.assertExpandOffsets([9, 10, 11, 12, 13, 14, 15], index, [12]) self.assertExpandOffsets([88, 89, 90, 91, 92, 93, 94], index, [91]) # If we hit an 'edge' we continue in the other direction self.assertExpandOffsets([1, 2, 3, 4, 5, 6], index, [2]) self.assertExpandOffsets([94, 95, 96, 97, 98, 99], index, [98]) # Requesting many nodes will expand all locations equally self.assertExpandOffsets([1, 2, 3, 80, 81, 82], index, [2, 81]) self.assertExpandOffsets([1, 2, 3, 9, 10, 11, 80, 81, 82], index, [2, 10, 81]) def test_stop_at_cached(self): index = self.make_100_node_index() self.set_cached_offsets(index, [0, 10, 19]) self.assertExpandOffsets([11, 12, 13, 14, 15, 16], index, [11]) self.assertExpandOffsets([11, 12, 13, 14, 15, 16], index, [12]) self.assertExpandOffsets([12, 13, 14, 15, 16, 17, 18], index, [15]) self.assertExpandOffsets([13, 14, 15, 16, 17, 18], index, [16]) self.assertExpandOffsets([13, 14, 15, 16, 17, 18], index, [17]) self.assertExpandOffsets([13, 14, 15, 16, 17, 18], index, [18]) def test_cannot_fully_expand(self): index = self.make_100_node_index() self.set_cached_offsets(index, [0, 10, 12]) # We don't go into an endless loop if we are bound by cached nodes self.assertExpandOffsets([11], index, [11]) def test_overlap(self): index = self.make_100_node_index() self.assertExpandOffsets([10, 11, 12, 13, 14, 15], index, [12, 13]) self.assertExpandOffsets([10, 11, 12, 13, 14, 15], index, [11, 14]) def test_stay_within_layer(self): index = self.make_1000_node_index() # When expanding a request, we won't read nodes from the next layer self.assertExpandOffsets([1, 2, 3, 4], index, [2]) self.assertExpandOffsets([6, 7, 8, 9], index, [6]) self.assertExpandOffsets([6, 7, 8, 9], index, [9]) self.assertExpandOffsets([10, 11, 12, 13, 14, 15], index, [10]) self.assertExpandOffsets([10, 11, 12, 13, 14, 15, 16], index, [13]) self.set_cached_offsets(index, [0, 4, 12]) self.assertExpandOffsets([5, 6, 7, 8, 9], index, [7]) self.assertExpandOffsets([10, 11], index, [11]) def test_small_requests_unexpanded(self): index = self.make_100_node_index() self.set_cached_offsets(index, [0]) self.assertExpandOffsets([1], index, [1]) self.assertExpandOffsets([50], index, [50]) # If we request more than one node, then we'll expand self.assertExpandOffsets([49, 50, 51, 59, 60, 61], index, [50, 60]) # The first pass does not expand index = self.make_1000_node_index() self.set_cached_offsets(index, [0]) self.assertExpandOffsets([1], index, [1]) self.set_cached_offsets(index, [0, 1]) self.assertExpandOffsets([100], index, [100]) self.set_cached_offsets(index, [0, 1, 100]) # But after the first depth, we will expand self.assertExpandOffsets([2, 3, 4, 5, 6, 7], index, [2]) self.assertExpandOffsets([2, 3, 4, 5, 6, 7], index, [4]) self.set_cached_offsets(index, [0, 1, 2, 3, 4, 5, 6, 7, 100]) self.assertExpandOffsets([102, 103, 104, 105, 106, 107, 108], index, [105]) bzrformats_3.4.0.orig/bzrformats/tests/test_chk_map.py0000644000000000000000000043303315162115107020231 0ustar00# Copyright (C) 2008-2011, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for maps built on a CHK versionedfiles facility.""" from bzrformats import osutils from bzrformats.errors import InconsistentDeltaDelta from .. import chk_map, groupcompress from ..chk_map import ( CHKMap, InternalNode, LeafNode, _bytes_to_text_key, _deserialise_internal_node, _deserialise_leaf_node, _search_key_16, _search_key_255, common_prefix_pair, ) from . import TestCase, TestCaseWithMemoryTransport class TestDeserialiseLeafNode(TestCase): """Tests for Deserialise Leaf Node.""" def assertDeserialiseErrors(self, text): """Assert DeserialiseErrors.""" self.assertRaises( (ValueError, IndexError), _deserialise_leaf_node, text, b"not-a-real-sha", ) def test_raises_on_non_leaf(self): """Test raises on non leaf.""" self.assertDeserialiseErrors(b"") self.assertDeserialiseErrors(b"short\n") self.assertDeserialiseErrors(b"chknotleaf:\n") self.assertDeserialiseErrors(b"chkleaf:x\n") self.assertDeserialiseErrors(b"chkleaf:\n") self.assertDeserialiseErrors(b"chkleaf:\nnotint\n") self.assertDeserialiseErrors(b"chkleaf:\n10\n") self.assertDeserialiseErrors(b"chkleaf:\n10\n256\n") self.assertDeserialiseErrors(b"chkleaf:\n10\n256\n10\n") def test_deserialise_empty(self): """Test deserialise empty.""" node = _deserialise_leaf_node( b"chkleaf:\n10\n1\n0\n\n", (b"sha1:1234",), ) self.assertEqual(0, len(node)) self.assertEqual(10, node.maximum_size) self.assertEqual((b"sha1:1234",), node.key()) self.assertIsInstance(node.key(), tuple) self.assertIs(None, node._search_prefix) self.assertIs(None, node._common_serialised_prefix) def test_deserialise_items(self): """Test deserialise items.""" node = _deserialise_leaf_node( b"chkleaf:\n0\n1\n2\n\nfoo bar\x001\nbaz\nquux\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo bar",), b"baz"), ((b"quux",), b"blarh")], sorted(node.iteritems(None)), ) def test_deserialise_item_with_null_width_1(self): """Test deserialise item with null width 1.""" node = _deserialise_leaf_node( b"chkleaf:\n0\n1\n2\n\nfoo\x001\nbar\x00baz\nquux\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo",), b"bar\x00baz"), ((b"quux",), b"blarh")], sorted(node.iteritems(None)), ) def test_deserialise_item_with_null_width_2(self): """Test deserialise item with null width 2.""" node = _deserialise_leaf_node( b"chkleaf:\n0\n2\n2\n\nfoo\x001\x001\nbar\x00baz\nquux\x00\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo", b"1"), b"bar\x00baz"), ((b"quux", b""), b"blarh")], sorted(node.iteritems(None)), ) def test_iteritems_selected_one_of_two_items(self): """Test iteritems selected one of two items.""" node = _deserialise_leaf_node( b"chkleaf:\n0\n1\n2\n\nfoo bar\x001\nbaz\nquux\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"quux",), b"blarh")], sorted(node.iteritems(None, [(b"quux",), (b"qaz",)])), ) def test_deserialise_item_with_common_prefix(self): """Test deserialise item with common prefix.""" node = _deserialise_leaf_node( b"chkleaf:\n0\n2\n2\nfoo\x00\n1\x001\nbar\x00baz\n2\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo", b"1"), b"bar\x00baz"), ((b"foo", b"2"), b"blarh")], sorted(node.iteritems(None)), ) self.assertIs(chk_map._unknown, node._search_prefix) self.assertEqual(b"foo\x00", node._common_serialised_prefix) def test_deserialise_multi_line(self): """Test deserialise multi line.""" node = _deserialise_leaf_node( b"chkleaf:\n0\n2\n2\nfoo\x00\n1\x002\nbar\nbaz\n2\x002\nblarh\n\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [ ((b"foo", b"1"), b"bar\nbaz"), ((b"foo", b"2"), b"blarh\n"), ], sorted(node.iteritems(None)), ) self.assertIs(chk_map._unknown, node._search_prefix) self.assertEqual(b"foo\x00", node._common_serialised_prefix) def test_key_after_map(self): """Test key after map.""" node = _deserialise_leaf_node(b"chkleaf:\n10\n1\n0\n\n", (b"sha1:1234",)) node.map(None, (b"foo bar",), b"baz quux") self.assertEqual(None, node.key()) def test_key_after_unmap(self): """Test key after unmap.""" node = _deserialise_leaf_node( b"chkleaf:\n0\n1\n2\n\nfoo bar\x001\nbaz\nquux\x001\nblarh\n", (b"sha1:1234",), ) node.unmap(None, (b"foo bar",)) self.assertEqual(None, node.key()) class TestDeserialiseInternalNode(TestCase): """Tests for Deserialise Internal Node.""" def assertDeserialiseErrors(self, text): """Assert DeserialiseErrors.""" self.assertRaises( (ValueError, IndexError), _deserialise_internal_node, text, (b"not-a-real-sha",), ) def test_raises_on_non_internal(self): """Test raises on non internal.""" self.assertDeserialiseErrors(b"") self.assertDeserialiseErrors(b"short\n") self.assertDeserialiseErrors(b"chknotnode:\n") self.assertDeserialiseErrors(b"chknode:x\n") self.assertDeserialiseErrors(b"chknode:\n") self.assertDeserialiseErrors(b"chknode:\nnotint\n") self.assertDeserialiseErrors(b"chknode:\n10\n") self.assertDeserialiseErrors(b"chknode:\n10\n256\n") self.assertDeserialiseErrors(b"chknode:\n10\n256\n10\n") # no trailing newline self.assertDeserialiseErrors(b"chknode:\n10\n256\n0\n1\nfo") def test_deserialise_one(self): """Test deserialise one.""" node = _deserialise_internal_node( b"chknode:\n10\n1\n1\n\na\x00sha1:abcd\n", (b"sha1:1234",), ) self.assertIsInstance(node, chk_map.InternalNode) self.assertEqual(1, len(node)) self.assertEqual(10, node.maximum_size) self.assertEqual((b"sha1:1234",), node.key()) self.assertEqual(b"", node._search_prefix) self.assertEqual({b"a": (b"sha1:abcd",)}, node._items) def test_deserialise_with_prefix(self): """Test deserialise with prefix.""" node = _deserialise_internal_node( b"chknode:\n10\n1\n1\npref\na\x00sha1:abcd\n", (b"sha1:1234",), ) self.assertIsInstance(node, chk_map.InternalNode) self.assertEqual(1, len(node)) self.assertEqual(10, node.maximum_size) self.assertEqual((b"sha1:1234",), node.key()) self.assertEqual(b"pref", node._search_prefix) self.assertEqual({b"prefa": (b"sha1:abcd",)}, node._items) node = _deserialise_internal_node( b"chknode:\n10\n1\n1\npref\n\x00sha1:abcd\n", (b"sha1:1234",), ) self.assertIsInstance(node, chk_map.InternalNode) self.assertEqual(1, len(node)) self.assertEqual(10, node.maximum_size) self.assertEqual((b"sha1:1234",), node.key()) self.assertEqual(b"pref", node._search_prefix) self.assertEqual({b"pref": (b"sha1:abcd",)}, node._items) def test_deserialise_pref_with_null(self): """Test deserialise pref with null.""" node = _deserialise_internal_node( b"chknode:\n10\n1\n1\npref\x00fo\n\x00sha1:abcd\n", (b"sha1:1234",), ) self.assertIsInstance(node, chk_map.InternalNode) self.assertEqual(1, len(node)) self.assertEqual(10, node.maximum_size) self.assertEqual((b"sha1:1234",), node.key()) self.assertEqual(b"pref\x00fo", node._search_prefix) self.assertEqual({b"pref\x00fo": (b"sha1:abcd",)}, node._items) def test_deserialise_with_null_pref(self): """Test deserialise with null pref.""" node = _deserialise_internal_node( b"chknode:\n10\n1\n1\npref\x00fo\n\x00\x00sha1:abcd\n", (b"sha1:1234",), ) self.assertIsInstance(node, chk_map.InternalNode) self.assertEqual(1, len(node)) self.assertEqual(10, node.maximum_size) self.assertEqual((b"sha1:1234",), node.key()) self.assertEqual(b"pref\x00fo", node._search_prefix) self.assertEqual({b"pref\x00fo\x00": (b"sha1:abcd",)}, node._items) class TestNode(TestCase): """Tests for Node.""" def assertCommonPrefix(self, expected_common, prefix, key): """Assert CommonPrefix.""" common = common_prefix_pair(prefix, key) self.assertLessEqual(len(common), len(prefix)) self.assertLessEqual(len(common), len(key)) self.assertStartsWith(prefix, common) self.assertStartsWith(key, common) self.assertEqual(expected_common, common) def test_common_prefix(self): """Test common prefix.""" self.assertCommonPrefix(b"beg", b"beg", b"begin") def test_no_common_prefix(self): """Test no common prefix.""" self.assertCommonPrefix(b"", b"begin", b"end") def test_equal(self): """Test equal.""" self.assertCommonPrefix(b"begin", b"begin", b"begin") def test_not_a_prefix(self): """Test not a prefix.""" self.assertCommonPrefix(b"b", b"begin", b"b") def test_empty(self): """Test empty.""" self.assertCommonPrefix(b"", b"", b"end") self.assertCommonPrefix(b"", b"begin", b"") self.assertCommonPrefix(b"", b"", b"") class TestCaseWithStore(TestCaseWithMemoryTransport): """Tests for Store.""" def get_chk_bytes(self): """Get chk bytes.""" # This creates a standalone CHK store. factory = groupcompress.make_pack_factory(False, False, 1) self.chk_bytes = factory(self.get_transport()) return self.chk_bytes def _get_map( self, a_dict, maximum_size=0, chk_bytes=None, key_width=1, search_key_func=None ): if chk_bytes is None: chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict( chk_bytes, a_dict, maximum_size=maximum_size, key_width=key_width, search_key_func=search_key_func, ) root_key2 = CHKMap._create_via_map( chk_bytes, a_dict, maximum_size=maximum_size, key_width=key_width, search_key_func=search_key_func, ) self.assertEqual( root_key, root_key2, "CHKMap.from_dict() did not match CHKMap._create_via_map", ) chkmap = CHKMap(chk_bytes, root_key, search_key_func=search_key_func) return chkmap def read_bytes(self, chk_bytes, key): """Read bytes.""" stream = chk_bytes.get_record_stream([key], "unordered", True) record = next(stream) if record.storage_kind == "absent": self.fail(f"Store does not contain the key {key}") return record.get_bytes_as("fulltext") def to_dict(self, node, *args): """To dict.""" return dict(node.iteritems(*args)) class TestCaseWithExampleMaps(TestCaseWithStore): """Tests for Example Maps.""" def get_chk_bytes(self): """Get chk bytes.""" if getattr(self, "_chk_bytes", None) is None: self._chk_bytes = super().get_chk_bytes() return self._chk_bytes def get_map(self, a_dict, maximum_size=100, search_key_func=None): """Get map.""" c_map = self._get_map( a_dict, maximum_size=maximum_size, chk_bytes=self.get_chk_bytes(), search_key_func=search_key_func, ) return c_map def make_root_only_map(self, search_key_func=None): """Make root only map.""" return self.get_map( { (b"aaa",): b"initial aaa content", (b"abb",): b"initial abb content", }, search_key_func=search_key_func, ) def make_root_only_aaa_ddd_map(self, search_key_func=None): """Make root only aaa ddd map.""" return self.get_map( { (b"aaa",): b"initial aaa content", (b"ddd",): b"initial ddd content", }, search_key_func=search_key_func, ) def make_one_deep_map(self, search_key_func=None): """Make one deep map.""" # Same as root_only_map, except it forces an InternalNode at the root return self.get_map( { (b"aaa",): b"initial aaa content", (b"abb",): b"initial abb content", (b"ccc",): b"initial ccc content", (b"ddd",): b"initial ddd content", }, search_key_func=search_key_func, ) def make_two_deep_map(self, search_key_func=None): """Make two deep map.""" # Carefully chosen so that it creates a 2-deep map for both # _search_key_plain and for _search_key_16 # Also so that things line up with make_one_deep_two_prefix_map return self.get_map( { (b"aaa",): b"initial aaa content", (b"abb",): b"initial abb content", (b"acc",): b"initial acc content", (b"ace",): b"initial ace content", (b"add",): b"initial add content", (b"adh",): b"initial adh content", (b"adl",): b"initial adl content", (b"ccc",): b"initial ccc content", (b"ddd",): b"initial ddd content", }, search_key_func=search_key_func, ) def make_one_deep_two_prefix_map(self, search_key_func=None): """Create a map with one internal node, but references are extra long. Otherwise has similar content to make_two_deep_map. """ return self.get_map( { (b"aaa",): b"initial aaa content", (b"add",): b"initial add content", (b"adh",): b"initial adh content", (b"adl",): b"initial adl content", }, search_key_func=search_key_func, ) def make_one_deep_one_prefix_map(self, search_key_func=None): """Create a map with one internal node, but references are extra long. Similar to make_one_deep_two_prefix_map, except the split is at the first char, rather than the second. """ return self.get_map( { (b"add",): b"initial add content", (b"adh",): b"initial adh content", (b"adl",): b"initial adl content", (b"bbb",): b"initial bbb content", }, search_key_func=search_key_func, ) class TestTestCaseWithExampleMaps(TestCaseWithExampleMaps): """Actual tests for the provided examples.""" def test_root_only_map_plain(self): """Test root only map plain.""" c_map = self.make_root_only_map() self.assertEqualDiff( "'' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " ('abb',) 'initial abb content'\n", c_map._dump_tree(), ) def test_root_only_map_16(self): """Test root only map 16.""" c_map = self.make_root_only_map(search_key_func=chk_map._search_key_16) self.assertEqualDiff( "'' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " ('abb',) 'initial abb content'\n", c_map._dump_tree(), ) def test_one_deep_map_plain(self): """Test one deep map plain.""" c_map = self.make_one_deep_map() self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " ('abb',) 'initial abb content'\n" " 'c' LeafNode\n" " ('ccc',) 'initial ccc content'\n" " 'd' LeafNode\n" " ('ddd',) 'initial ddd content'\n", c_map._dump_tree(), ) def test_one_deep_map_16(self): """Test one deep map 16.""" c_map = self.make_one_deep_map(search_key_func=chk_map._search_key_16) self.assertEqualDiff( "'' InternalNode\n" " '2' LeafNode\n" " ('ccc',) 'initial ccc content'\n" " '4' LeafNode\n" " ('abb',) 'initial abb content'\n" " 'F' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " ('ddd',) 'initial ddd content'\n", c_map._dump_tree(), ) def test_root_only_aaa_ddd_plain(self): """Test root only aaa ddd plain.""" c_map = self.make_root_only_aaa_ddd_map() self.assertEqualDiff( "'' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " ('ddd',) 'initial ddd content'\n", c_map._dump_tree(), ) def test_root_only_aaa_ddd_16(self): """Test root only aaa ddd 16.""" c_map = self.make_root_only_aaa_ddd_map(search_key_func=chk_map._search_key_16) # We use 'aaa' and 'ddd' because they happen to map to 'F' when using # _search_key_16 self.assertEqualDiff( "'' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " ('ddd',) 'initial ddd content'\n", c_map._dump_tree(), ) def test_two_deep_map_plain(self): """Test two deep map plain.""" c_map = self.make_two_deep_map() self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aa' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " 'ab' LeafNode\n" " ('abb',) 'initial abb content'\n" " 'ac' LeafNode\n" " ('acc',) 'initial acc content'\n" " ('ace',) 'initial ace content'\n" " 'ad' LeafNode\n" " ('add',) 'initial add content'\n" " ('adh',) 'initial adh content'\n" " ('adl',) 'initial adl content'\n" " 'c' LeafNode\n" " ('ccc',) 'initial ccc content'\n" " 'd' LeafNode\n" " ('ddd',) 'initial ddd content'\n", c_map._dump_tree(), ) def test_two_deep_map_16(self): """Test two deep map 16.""" c_map = self.make_two_deep_map(search_key_func=chk_map._search_key_16) self.assertEqualDiff( "'' InternalNode\n" " '2' LeafNode\n" " ('acc',) 'initial acc content'\n" " ('ccc',) 'initial ccc content'\n" " '4' LeafNode\n" " ('abb',) 'initial abb content'\n" " 'C' LeafNode\n" " ('ace',) 'initial ace content'\n" " 'F' InternalNode\n" " 'F0' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " 'F3' LeafNode\n" " ('adl',) 'initial adl content'\n" " 'F4' LeafNode\n" " ('adh',) 'initial adh content'\n" " 'FB' LeafNode\n" " ('ddd',) 'initial ddd content'\n" " 'FD' LeafNode\n" " ('add',) 'initial add content'\n", c_map._dump_tree(), ) def test_one_deep_two_prefix_map_plain(self): """Test one deep two prefix map plain.""" c_map = self.make_one_deep_two_prefix_map() self.assertEqualDiff( "'' InternalNode\n" " 'aa' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " 'ad' LeafNode\n" " ('add',) 'initial add content'\n" " ('adh',) 'initial adh content'\n" " ('adl',) 'initial adl content'\n", c_map._dump_tree(), ) def test_one_deep_two_prefix_map_16(self): """Test one deep two prefix map 16.""" c_map = self.make_one_deep_two_prefix_map( search_key_func=chk_map._search_key_16 ) self.assertEqualDiff( "'' InternalNode\n" " 'F0' LeafNode\n" " ('aaa',) 'initial aaa content'\n" " 'F3' LeafNode\n" " ('adl',) 'initial adl content'\n" " 'F4' LeafNode\n" " ('adh',) 'initial adh content'\n" " 'FD' LeafNode\n" " ('add',) 'initial add content'\n", c_map._dump_tree(), ) def test_one_deep_one_prefix_map_plain(self): """Test one deep one prefix map plain.""" c_map = self.make_one_deep_one_prefix_map() self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('add',) 'initial add content'\n" " ('adh',) 'initial adh content'\n" " ('adl',) 'initial adl content'\n" " 'b' LeafNode\n" " ('bbb',) 'initial bbb content'\n", c_map._dump_tree(), ) def test_one_deep_one_prefix_map_16(self): """Test one deep one prefix map 16.""" c_map = self.make_one_deep_one_prefix_map( search_key_func=chk_map._search_key_16 ) self.assertEqualDiff( "'' InternalNode\n" " '4' LeafNode\n" " ('bbb',) 'initial bbb content'\n" " 'F' LeafNode\n" " ('add',) 'initial add content'\n" " ('adh',) 'initial adh content'\n" " ('adl',) 'initial adl content'\n", c_map._dump_tree(), ) class TestMap(TestCaseWithStore): """Tests for Map.""" def assertHasABMap(self, chk_bytes): """Assert HasABMap.""" ab_leaf_bytes = b"chkleaf:\n0\n1\n1\na\n\x001\nb\n" ab_sha1 = osutils.sha_string(ab_leaf_bytes) self.assertEqual(b"90986195696b177c8895d48fdb4b7f2366f798a0", ab_sha1) root_key = (b"sha1:" + ab_sha1,) self.assertEqual(ab_leaf_bytes, self.read_bytes(chk_bytes, root_key)) return root_key def assertHasEmptyMap(self, chk_bytes): """Assert HasEmptyMap.""" empty_leaf_bytes = b"chkleaf:\n0\n1\n0\n\n" empty_sha1 = osutils.sha_string(empty_leaf_bytes) self.assertEqual(b"8571e09bf1bcc5b9621ce31b3d4c93d6e9a1ed26", empty_sha1) root_key = (b"sha1:" + empty_sha1,) self.assertEqual(empty_leaf_bytes, self.read_bytes(chk_bytes, root_key)) return root_key def assertMapLayoutEqual(self, map_one, map_two): """Assert that the internal structure is identical between the maps.""" map_one._ensure_root() node_one_stack = [map_one._root_node] map_two._ensure_root() node_two_stack = [map_two._root_node] while node_one_stack: node_one = node_one_stack.pop() node_two = node_two_stack.pop() if node_one.__class__ != node_two.__class__: self.assertEqualDiff( map_one._dump_tree(include_keys=True), map_two._dump_tree(include_keys=True), ) self.assertEqual(node_one._search_prefix, node_two._search_prefix) if isinstance(node_one, InternalNode): # Internal nodes must have identical references self.assertEqual( sorted(node_one._items.keys()), sorted(node_two._items.keys()) ) node_one_stack.extend( sorted( [n for n, _ in node_one._iter_nodes(map_one._store)], key=lambda a: a._search_prefix, ) ) node_two_stack.extend( sorted( [n for n, _ in node_two._iter_nodes(map_two._store)], key=lambda a: a._search_prefix, ) ) else: # Leaf nodes must have identical contents self.assertEqual(node_one._items, node_two._items) self.assertEqual([], node_two_stack) def assertCanonicalForm(self, chkmap): """Assert that the chkmap is in 'canonical' form. We do this by adding all of the key value pairs from scratch, both in forward order and reverse order, and assert that the final tree layout is identical. """ items = list(chkmap.iteritems()) map_forward = chk_map.CHKMap(None, None) map_forward._root_node.set_maximum_size(chkmap._root_node.maximum_size) for key, value in items: map_forward.map(key, value) self.assertMapLayoutEqual(map_forward, chkmap) map_reverse = chk_map.CHKMap(None, None) map_reverse._root_node.set_maximum_size(chkmap._root_node.maximum_size) for key, value in reversed(items): map_reverse.map(key, value) self.assertMapLayoutEqual(map_reverse, chkmap) def test_assert_map_layout_equal(self): """Test assert map layout equal.""" store = self.get_chk_bytes() map_one = CHKMap(store, None) map_one._root_node.set_maximum_size(20) map_two = CHKMap(store, None) map_two._root_node.set_maximum_size(20) self.assertMapLayoutEqual(map_one, map_two) map_one.map((b"aaa",), b"value") self.assertRaises(AssertionError, self.assertMapLayoutEqual, map_one, map_two) map_two.map((b"aaa",), b"value") self.assertMapLayoutEqual(map_one, map_two) # Split the tree, so we ensure that internal nodes and leaf nodes are # properly checked map_one.map((b"aab",), b"value") self.assertIsInstance(map_one._root_node, InternalNode) self.assertRaises(AssertionError, self.assertMapLayoutEqual, map_one, map_two) map_two.map((b"aab",), b"value") self.assertMapLayoutEqual(map_one, map_two) map_one.map((b"aac",), b"value") self.assertRaises(AssertionError, self.assertMapLayoutEqual, map_one, map_two) self.assertCanonicalForm(map_one) def test_from_dict_empty(self): """Test from dict empty.""" chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict(chk_bytes, {}) # Check the data was saved and inserted correctly. expected_root_key = self.assertHasEmptyMap(chk_bytes) self.assertEqual(expected_root_key, root_key) def test_from_dict_ab(self): """Test from dict ab.""" chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict(chk_bytes, {(b"a",): b"b"}) # Check the data was saved and inserted correctly. expected_root_key = self.assertHasABMap(chk_bytes) self.assertEqual(expected_root_key, root_key) def test_apply_empty_ab(self): """Test apply empty ab.""" # applying a delta (None, "a", "b") to an empty chkmap generates the # same map as from_dict_ab. chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict(chk_bytes, {}) chkmap = CHKMap(chk_bytes, root_key) new_root = chkmap.apply_delta([(None, (b"a",), b"b")]) # Check the data was saved and inserted correctly. expected_root_key = self.assertHasABMap(chk_bytes) self.assertEqual(expected_root_key, new_root) # The update should have left us with an in memory root node, with an # updated key. self.assertEqual(new_root, chkmap._root_node._key) def test_apply_ab_empty(self): """Test apply ab empty.""" # applying a delta ("a", None, None) to a map with 'a' in it generates # an empty map. chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict(chk_bytes, {(b"a",): b"b"}) chkmap = CHKMap(chk_bytes, root_key) new_root = chkmap.apply_delta([((b"a",), None, None)]) # Check the data was saved and inserted correctly. expected_root_key = self.assertHasEmptyMap(chk_bytes) self.assertEqual(expected_root_key, new_root) # The update should have left us with an in memory root node, with an # updated key. self.assertEqual(new_root, chkmap._root_node._key) def test_apply_delete_to_internal_node(self): """Test apply delete to internal node.""" # applying a delta should be convert an internal root node to a leaf # node if the delta shrinks the map enough. store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Add three items: 2 small enough to fit in one node, and one huge to # force multiple nodes. chkmap._root_node.set_maximum_size(100) chkmap.map((b"small",), b"value") chkmap.map((b"little",), b"value") chkmap.map((b"very-big",), b"x" * 100) # (Check that we have constructed the scenario we want to test) self.assertIsInstance(chkmap._root_node, InternalNode) # Delete the huge item so that the map fits in one node again. delta = [((b"very-big",), None, None)] chkmap.apply_delta(delta) self.assertCanonicalForm(chkmap) self.assertIsInstance(chkmap._root_node, LeafNode) def test_apply_new_keys_must_be_new(self): """Test apply new keys must be new.""" # applying a delta (None, "a", "b") to a map with 'a' in it generates # an error. chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict(chk_bytes, {(b"a",): b"b"}) chkmap = CHKMap(chk_bytes, root_key) self.assertRaises( InconsistentDeltaDelta, chkmap.apply_delta, [(None, (b"a",), b"b")] ) # As an error occured, the update should have left us without changing # anything (the root should be unchanged). self.assertEqual(root_key, chkmap._root_node._key) def test_apply_delta_is_deterministic(self): """Test apply delta is deterministic.""" chk_bytes = self.get_chk_bytes() chkmap1 = CHKMap(chk_bytes, None) chkmap1._root_node.set_maximum_size(10) chkmap1.apply_delta( [ (None, (b"aaa",), b"common"), (None, (b"bba",), b"target2"), (None, (b"bbb",), b"common"), ] ) root_key1 = chkmap1._save() self.assertCanonicalForm(chkmap1) chkmap2 = CHKMap(chk_bytes, None) chkmap2._root_node.set_maximum_size(10) chkmap2.apply_delta( [ (None, (b"bbb",), b"common"), (None, (b"bba",), b"target2"), (None, (b"aaa",), b"common"), ] ) root_key2 = chkmap2._save() self.assertEqualDiff( chkmap1._dump_tree(include_keys=True), chkmap2._dump_tree(include_keys=True) ) self.assertEqual(root_key1, root_key2) self.assertCanonicalForm(chkmap2) def test_stable_splitting(self): """Test stable splitting.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(35) chkmap.map((b"aaa",), b"v") self.assertEqualDiff("'' LeafNode\n ('aaa',) 'v'\n", chkmap._dump_tree()) chkmap.map((b"aab",), b"v") self.assertEqualDiff( "'' LeafNode\n ('aaa',) 'v'\n ('aab',) 'v'\n", chkmap._dump_tree(), ) self.assertCanonicalForm(chkmap) # Creates a new internal node, and splits the others into leaves chkmap.map((b"aac",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'v'\n" " 'aac' LeafNode\n" " ('aac',) 'v'\n", chkmap._dump_tree(), ) self.assertCanonicalForm(chkmap) # Splits again, because it can't fit in the current structure chkmap.map((b"bbb",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'v'\n" " 'aac' LeafNode\n" " ('aac',) 'v'\n" " 'b' LeafNode\n" " ('bbb',) 'v'\n", chkmap._dump_tree(), ) self.assertCanonicalForm(chkmap) def test_map_splits_with_longer_key(self): """Test map splits with longer key.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 1 key per LeafNode chkmap._root_node.set_maximum_size(10) chkmap.map((b"aaa",), b"v") chkmap.map((b"aaaa",), b"v") self.assertCanonicalForm(chkmap) self.assertIsInstance(chkmap._root_node, InternalNode) def test_with_linefeed_in_key(self): """Test with linefeed in key.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 1 key per LeafNode chkmap._root_node.set_maximum_size(10) chkmap.map((b"a\ra",), b"val1") chkmap.map((b"a\rb",), b"val2") chkmap.map((b"ac",), b"val3") self.assertCanonicalForm(chkmap) self.assertEqualDiff( "'' InternalNode\n" " 'a\\r' InternalNode\n" " 'a\\ra' LeafNode\n" " ('a\\ra',) 'val1'\n" " 'a\\rb' LeafNode\n" " ('a\\rb',) 'val2'\n" " 'ac' LeafNode\n" " ('ac',) 'val3'\n", chkmap._dump_tree(), ) # We should also successfully serialise and deserialise these items root_key = chkmap._save() chkmap = CHKMap(store, root_key) self.assertEqualDiff( "'' InternalNode\n" " 'a\\r' InternalNode\n" " 'a\\ra' LeafNode\n" " ('a\\ra',) 'val1'\n" " 'a\\rb' LeafNode\n" " ('a\\rb',) 'val2'\n" " 'ac' LeafNode\n" " ('ac',) 'val3'\n", chkmap._dump_tree(), ) def test_deep_splitting(self): """Test deep splitting.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(40) chkmap.map((b"aaaaaaaa",), b"v") chkmap.map((b"aaaaabaa",), b"v") self.assertEqualDiff( "'' LeafNode\n ('aaaaaaaa',) 'v'\n ('aaaaabaa',) 'v'\n", chkmap._dump_tree(), ) chkmap.map((b"aaabaaaa",), b"v") chkmap.map((b"aaababaa",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aaaa' LeafNode\n" " ('aaaaaaaa',) 'v'\n" " ('aaaaabaa',) 'v'\n" " 'aaab' LeafNode\n" " ('aaabaaaa',) 'v'\n" " ('aaababaa',) 'v'\n", chkmap._dump_tree(), ) chkmap.map((b"aaabacaa",), b"v") chkmap.map((b"aaabadaa",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aaaa' LeafNode\n" " ('aaaaaaaa',) 'v'\n" " ('aaaaabaa',) 'v'\n" " 'aaab' InternalNode\n" " 'aaabaa' LeafNode\n" " ('aaabaaaa',) 'v'\n" " 'aaabab' LeafNode\n" " ('aaababaa',) 'v'\n" " 'aaabac' LeafNode\n" " ('aaabacaa',) 'v'\n" " 'aaabad' LeafNode\n" " ('aaabadaa',) 'v'\n", chkmap._dump_tree(), ) chkmap.map((b"aaababba",), b"val") chkmap.map((b"aaababca",), b"val") self.assertEqualDiff( "'' InternalNode\n" " 'aaaa' LeafNode\n" " ('aaaaaaaa',) 'v'\n" " ('aaaaabaa',) 'v'\n" " 'aaab' InternalNode\n" " 'aaabaa' LeafNode\n" " ('aaabaaaa',) 'v'\n" " 'aaabab' InternalNode\n" " 'aaababa' LeafNode\n" " ('aaababaa',) 'v'\n" " 'aaababb' LeafNode\n" " ('aaababba',) 'val'\n" " 'aaababc' LeafNode\n" " ('aaababca',) 'val'\n" " 'aaabac' LeafNode\n" " ('aaabacaa',) 'v'\n" " 'aaabad' LeafNode\n" " ('aaabadaa',) 'v'\n", chkmap._dump_tree(), ) # Now we add a node that should fit around an existing InternalNode, # but has a slightly different key prefix, which causes a new # InternalNode split chkmap.map((b"aaabDaaa",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aaaa' LeafNode\n" " ('aaaaaaaa',) 'v'\n" " ('aaaaabaa',) 'v'\n" " 'aaab' InternalNode\n" " 'aaabD' LeafNode\n" " ('aaabDaaa',) 'v'\n" " 'aaaba' InternalNode\n" " 'aaabaa' LeafNode\n" " ('aaabaaaa',) 'v'\n" " 'aaabab' InternalNode\n" " 'aaababa' LeafNode\n" " ('aaababaa',) 'v'\n" " 'aaababb' LeafNode\n" " ('aaababba',) 'val'\n" " 'aaababc' LeafNode\n" " ('aaababca',) 'val'\n" " 'aaabac' LeafNode\n" " ('aaabacaa',) 'v'\n" " 'aaabad' LeafNode\n" " ('aaabadaa',) 'v'\n", chkmap._dump_tree(), ) def test_map_collapses_if_size_changes(self): """Test map collapses if size changes.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(35) chkmap.map((b"aaa",), b"v") chkmap.map((b"aab",), b"very long value that splits") self.assertEqualDiff( "'' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'very long value that splits'\n", chkmap._dump_tree(), ) self.assertCanonicalForm(chkmap) # Now changing the value to something small should cause a rebuild chkmap.map((b"aab",), b"v") self.assertEqualDiff( "'' LeafNode\n ('aaa',) 'v'\n ('aab',) 'v'\n", chkmap._dump_tree(), ) self.assertCanonicalForm(chkmap) def test_map_double_deep_collapses(self): """Test map double deep collapses.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 3 small keys per LeafNode chkmap._root_node.set_maximum_size(40) chkmap.map((b"aaa",), b"v") chkmap.map((b"aab",), b"very long value that splits") chkmap.map((b"abc",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aa' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'very long value that splits'\n" " 'ab' LeafNode\n" " ('abc',) 'v'\n", chkmap._dump_tree(), ) chkmap.map((b"aab",), b"v") self.assertCanonicalForm(chkmap) self.assertEqualDiff( "'' LeafNode\n ('aaa',) 'v'\n ('aab',) 'v'\n ('abc',) 'v'\n", chkmap._dump_tree(), ) def test_stable_unmap(self): """Test stable unmap.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(35) chkmap.map((b"aaa",), b"v") chkmap.map((b"aab",), b"v") self.assertEqualDiff( "'' LeafNode\n ('aaa',) 'v'\n ('aab',) 'v'\n", chkmap._dump_tree(), ) # Creates a new internal node, and splits the others into leaves chkmap.map((b"aac",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'v'\n" " 'aac' LeafNode\n" " ('aac',) 'v'\n", chkmap._dump_tree(), ) self.assertCanonicalForm(chkmap) # Now lets unmap one of the keys, and assert that we collapse the # structures. chkmap.unmap((b"aac",)) self.assertEqualDiff( "'' LeafNode\n ('aaa',) 'v'\n ('aab',) 'v'\n", chkmap._dump_tree(), ) self.assertCanonicalForm(chkmap) def test_unmap_double_deep(self): """Test unmap double deep.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 3 keys per LeafNode chkmap._root_node.set_maximum_size(40) chkmap.map((b"aaa",), b"v") chkmap.map((b"aaab",), b"v") chkmap.map((b"aab",), b"very long value") chkmap.map((b"abc",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aa' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " ('aaab',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'very long value'\n" " 'ab' LeafNode\n" " ('abc',) 'v'\n", chkmap._dump_tree(), ) # Removing the 'aab' key should cause everything to collapse back to a # single node chkmap.unmap((b"aab",)) self.assertEqualDiff( "'' LeafNode\n" " ('aaa',) 'v'\n" " ('aaab',) 'v'\n" " ('abc',) 'v'\n", chkmap._dump_tree(), ) def test_unmap_double_deep_non_empty_leaf(self): """Test unmap double deep non empty leaf.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 3 keys per LeafNode chkmap._root_node.set_maximum_size(40) chkmap.map((b"aaa",), b"v") chkmap.map((b"aab",), b"long value") chkmap.map((b"aabb",), b"v") chkmap.map((b"abc",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aa' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'long value'\n" " ('aabb',) 'v'\n" " 'ab' LeafNode\n" " ('abc',) 'v'\n", chkmap._dump_tree(), ) # Removing the 'aab' key should cause everything to collapse back to a # single node chkmap.unmap((b"aab",)) self.assertEqualDiff( "'' LeafNode\n" " ('aaa',) 'v'\n" " ('aabb',) 'v'\n" " ('abc',) 'v'\n", chkmap._dump_tree(), ) def test_unmap_with_known_internal_node_doesnt_page(self): """Test unmap with known internal node doesnt page.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 3 keys per LeafNode chkmap._root_node.set_maximum_size(30) chkmap.map((b"aaa",), b"v") chkmap.map((b"aab",), b"v") chkmap.map((b"aac",), b"v") chkmap.map((b"abc",), b"v") chkmap.map((b"acd",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aa' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'v'\n" " 'aac' LeafNode\n" " ('aac',) 'v'\n" " 'ab' LeafNode\n" " ('abc',) 'v'\n" " 'ac' LeafNode\n" " ('acd',) 'v'\n", chkmap._dump_tree(), ) # Save everything to the map, and start over chkmap = CHKMap(store, chkmap._save()) # Mapping an 'aa' key loads the internal node, but should not map the # 'ab' and 'ac' nodes chkmap.map((b"aad",), b"v") self.assertIsInstance(chkmap._root_node._items[b"aa"], InternalNode) self.assertIsInstance(chkmap._root_node._items[b"ab"], tuple) self.assertIsInstance(chkmap._root_node._items[b"ac"], tuple) # Unmapping 'acd' can notice that 'aa' is an InternalNode and not have # to map in 'ab' chkmap.unmap((b"acd",)) self.assertIsInstance(chkmap._root_node._items[b"aa"], InternalNode) self.assertIsInstance(chkmap._root_node._items[b"ab"], tuple) def test_unmap_without_fitting_doesnt_page_in(self): """Test unmap without fitting doesnt page in.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(20) chkmap.map((b"aaa",), b"v") chkmap.map((b"aab",), b"v") self.assertEqualDiff( "'' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'v'\n" " 'aab' LeafNode\n" " ('aab',) 'v'\n", chkmap._dump_tree(), ) # Save everything to the map, and start over chkmap = CHKMap(store, chkmap._save()) chkmap.map((b"aac",), b"v") chkmap.map((b"aad",), b"v") chkmap.map((b"aae",), b"v") chkmap.map((b"aaf",), b"v") # At this point, the previous nodes should not be paged in, but the # newly added nodes would be self.assertIsInstance(chkmap._root_node._items[b"aaa"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aab"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aac"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aad"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aae"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aaf"], LeafNode) # Now unmapping one of the new nodes will use only the already-paged-in # nodes to determine that we don't need to do more. chkmap.unmap((b"aaf",)) self.assertIsInstance(chkmap._root_node._items[b"aaa"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aab"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aac"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aad"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aae"], LeafNode) def test_unmap_pages_in_if_necessary(self): """Test unmap pages in if necessary.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(30) chkmap.map((b"aaa",), b"val") chkmap.map((b"aab",), b"val") chkmap.map((b"aac",), b"val") self.assertEqualDiff( "'' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'val'\n" " 'aab' LeafNode\n" " ('aab',) 'val'\n" " 'aac' LeafNode\n" " ('aac',) 'val'\n", chkmap._dump_tree(), ) root_key = chkmap._save() # Save everything to the map, and start over chkmap = CHKMap(store, root_key) chkmap.map((b"aad",), b"v") # At this point, the previous nodes should not be paged in, but the # newly added node would be self.assertIsInstance(chkmap._root_node._items[b"aaa"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aab"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aac"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aad"], LeafNode) # Unmapping the new node will check the existing nodes to see if they # would fit. # Clear the page cache so we ensure we have to read all the children chk_map.clear_cache() chkmap.unmap((b"aad",)) self.assertIsInstance(chkmap._root_node._items[b"aaa"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aab"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aac"], LeafNode) def test_unmap_pages_in_from_page_cache(self): """Test unmap pages in from page cache.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(30) chkmap.map((b"aaa",), b"val") chkmap.map((b"aab",), b"val") chkmap.map((b"aac",), b"val") root_key = chkmap._save() # Save everything to the map, and start over chkmap = CHKMap(store, root_key) chkmap.map((b"aad",), b"val") self.assertEqualDiff( "'' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'val'\n" " 'aab' LeafNode\n" " ('aab',) 'val'\n" " 'aac' LeafNode\n" " ('aac',) 'val'\n" " 'aad' LeafNode\n" " ('aad',) 'val'\n", chkmap._dump_tree(), ) # Save everything to the map, start over after _dump_tree chkmap = CHKMap(store, root_key) chkmap.map((b"aad",), b"v") # At this point, the previous nodes should not be paged in, but the # newly added node would be self.assertIsInstance(chkmap._root_node._items[b"aaa"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aab"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aac"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aad"], LeafNode) # Now clear the page cache, and only include 2 of the children in the # cache aab_key = chkmap._root_node._items[b"aab"] aab_bytes = chk_map._get_cache()[aab_key] aac_key = chkmap._root_node._items[b"aac"] aac_bytes = chk_map._get_cache()[aac_key] chk_map.clear_cache() chk_map._get_cache()[aab_key] = aab_bytes chk_map._get_cache()[aac_key] = aac_bytes # Unmapping the new node will check the nodes from the page cache # first, and not have to read in 'aaa' chkmap.unmap((b"aad",)) self.assertIsInstance(chkmap._root_node._items[b"aaa"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aab"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aac"], LeafNode) def test_unmap_uses_existing_items(self): """Test unmap uses existing items.""" store = self.get_chk_bytes() chkmap = CHKMap(store, None) # Should fit 2 keys per LeafNode chkmap._root_node.set_maximum_size(30) chkmap.map((b"aaa",), b"val") chkmap.map((b"aab",), b"val") chkmap.map((b"aac",), b"val") root_key = chkmap._save() # Save everything to the map, and start over chkmap = CHKMap(store, root_key) chkmap.map((b"aad",), b"val") chkmap.map((b"aae",), b"val") chkmap.map((b"aaf",), b"val") # At this point, the previous nodes should not be paged in, but the # newly added node would be self.assertIsInstance(chkmap._root_node._items[b"aaa"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aab"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aac"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aad"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aae"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aaf"], LeafNode) # Unmapping a new node will see the other nodes that are already in # memory, and not need to page in anything else chkmap.unmap((b"aad",)) self.assertIsInstance(chkmap._root_node._items[b"aaa"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aab"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aac"], tuple) self.assertIsInstance(chkmap._root_node._items[b"aae"], LeafNode) self.assertIsInstance(chkmap._root_node._items[b"aaf"], LeafNode) def test_iter_changes_empty_ab(self): """Test iter changes empty ab.""" # Asking for changes between an empty dict to a dict with keys returns # all the keys. basis = self._get_map({}, maximum_size=10) target = self._get_map( {(b"a",): b"content here", (b"b",): b"more content"}, chk_bytes=basis._store, maximum_size=10, ) self.assertEqual( [((b"a",), None, b"content here"), ((b"b",), None, b"more content")], sorted(target.iter_changes(basis)), ) def test_iter_changes_ab_empty(self): """Test iter changes ab empty.""" # Asking for changes between a dict with keys to an empty dict returns # all the keys. basis = self._get_map( {(b"a",): b"content here", (b"b",): b"more content"}, maximum_size=10 ) target = self._get_map({}, chk_bytes=basis._store, maximum_size=10) self.assertEqual( [((b"a",), b"content here", None), ((b"b",), b"more content", None)], sorted(target.iter_changes(basis)), ) def test_iter_changes_empty_empty_is_empty(self): """Test iter changes empty empty is empty.""" basis = self._get_map({}, maximum_size=10) target = self._get_map({}, chk_bytes=basis._store, maximum_size=10) self.assertEqual([], sorted(target.iter_changes(basis))) def test_iter_changes_ab_ab_is_empty(self): """Test iter changes ab ab is empty.""" basis = self._get_map( {(b"a",): b"content here", (b"b",): b"more content"}, maximum_size=10 ) target = self._get_map( {(b"a",): b"content here", (b"b",): b"more content"}, chk_bytes=basis._store, maximum_size=10, ) self.assertEqual([], sorted(target.iter_changes(basis))) def test_iter_changes_ab_ab_nodes_not_loaded(self): """Test iter changes ab ab nodes not loaded.""" basis = self._get_map( {(b"a",): b"content here", (b"b",): b"more content"}, maximum_size=10 ) target = self._get_map( {(b"a",): b"content here", (b"b",): b"more content"}, chk_bytes=basis._store, maximum_size=10, ) list(target.iter_changes(basis)) self.assertIsInstance(target._root_node, tuple) self.assertIsInstance(basis._root_node, tuple) def test_iter_changes_ab_ab_changed_values_shown(self): """Test iter changes ab ab changed values shown.""" basis = self._get_map( {(b"a",): b"content here", (b"b",): b"more content"}, maximum_size=10 ) target = self._get_map( {(b"a",): b"content here", (b"b",): b"different content"}, chk_bytes=basis._store, maximum_size=10, ) result = sorted(target.iter_changes(basis)) self.assertEqual([((b"b",), b"more content", b"different content")], result) def test_iter_changes_mixed_node_length(self): """Test iter changes mixed node length.""" # When one side has different node lengths than the other, common # but different keys still need to be show, and new-and-old included # appropriately. # aaa - common unaltered # aab - common altered # b - basis only # at - target only # we expect: # aaa to be not loaded (later test) # aab, b, at to be returned. # basis splits at byte 0,1,2, aaa is commonb is basis only basis_dict = { (b"aaa",): b"foo bar", (b"aab",): b"common altered a", (b"b",): b"foo bar b", } # target splits at byte 1,2, at is target only target_dict = { (b"aaa",): b"foo bar", (b"aab",): b"common altered b", (b"at",): b"foo bar t", } changes = [ ((b"aab",), b"common altered a", b"common altered b"), ((b"at",), None, b"foo bar t"), ((b"b",), b"foo bar b", None), ] basis = self._get_map(basis_dict, maximum_size=10) target = self._get_map(target_dict, maximum_size=10, chk_bytes=basis._store) self.assertEqual(changes, sorted(target.iter_changes(basis))) def test_iter_changes_common_pages_not_loaded(self): """Test iter changes common pages not loaded.""" # aaa - common unaltered # aab - common altered # b - basis only # at - target only # we expect: # aaa to be not loaded # aaa not to be in result. basis_dict = { (b"aaa",): b"foo bar", (b"aab",): b"common altered a", (b"b",): b"foo bar b", } # target splits at byte 1, at is target only target_dict = { (b"aaa",): b"foo bar", (b"aab",): b"common altered b", (b"at",): b"foo bar t", } basis = self._get_map(basis_dict, maximum_size=10) target = self._get_map(target_dict, maximum_size=10, chk_bytes=basis._store) basis_get = basis._store.get_record_stream def get_record_stream(keys, order, fulltext): if (b"sha1:1adf7c0d1b9140ab5f33bb64c6275fa78b1580b7",) in keys: raise AssertionError(f"'aaa' pointer was followed {keys!r}") return basis_get(keys, order, fulltext) basis._store.get_record_stream = get_record_stream result = sorted(target.iter_changes(basis)) for change in result: if change[0] == (b"aaa",): self.fail(f"Found unexpected change: {change}") def test_iter_changes_unchanged_keys_in_multi_key_leafs_ignored(self): """Test iter changes unchanged keys in multi key leafs ignored.""" # Within a leaf there are no hash's to exclude keys, make sure multi # value leaf nodes are handled well. basis_dict = { (b"aaa",): b"foo bar", (b"aab",): b"common altered a", (b"b",): b"foo bar b", } target_dict = { (b"aaa",): b"foo bar", (b"aab",): b"common altered b", (b"at",): b"foo bar t", } changes = [ ((b"aab",), b"common altered a", b"common altered b"), ((b"at",), None, b"foo bar t"), ((b"b",), b"foo bar b", None), ] basis = self._get_map(basis_dict) target = self._get_map(target_dict, chk_bytes=basis._store) self.assertEqual(changes, sorted(target.iter_changes(basis))) def test_iteritems_empty(self): """Test iteritems empty.""" chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict(chk_bytes, {}) chkmap = CHKMap(chk_bytes, root_key) self.assertEqual([], list(chkmap.iteritems())) def test_iteritems_two_items(self): """Test iteritems two items.""" chk_bytes = self.get_chk_bytes() root_key = CHKMap.from_dict( chk_bytes, {(b"a",): b"content here", (b"b",): b"more content"} ) chkmap = CHKMap(chk_bytes, root_key) self.assertEqual( [((b"a",), b"content here"), ((b"b",), b"more content")], sorted(chkmap.iteritems()), ) def test_iteritems_selected_one_of_two_items(self): """Test iteritems selected one of two items.""" chkmap = self._get_map({(b"a",): b"content here", (b"b",): b"more content"}) self.assertEqual({(b"a",): b"content here"}, self.to_dict(chkmap, [(b"a",)])) def test_iteritems_keys_prefixed_by_2_width_nodes(self): """Test iteritems keys prefixed by 2 width nodes.""" chkmap = self._get_map( { (b"a", b"a"): b"content here", ( b"a", b"b", ): b"more content", (b"b", b""): b"boring content", }, maximum_size=10, key_width=2, ) self.assertEqual( {(b"a", b"a"): b"content here", (b"a", b"b"): b"more content"}, self.to_dict(chkmap, [(b"a",)]), ) def test_iteritems_keys_prefixed_by_2_width_nodes_hashed(self): """Test iteritems keys prefixed by 2 width nodes hashed.""" search_key_func = chk_map.search_key_registry.get(b"hash-16-way") self.assertEqual(b"E8B7BE43\x00E8B7BE43", search_key_func((b"a", b"a"))) self.assertEqual(b"E8B7BE43\x0071BEEFF9", search_key_func((b"a", b"b"))) self.assertEqual(b"71BEEFF9\x0000000000", search_key_func((b"b", b""))) chkmap = self._get_map( { (b"a", b"a"): b"content here", ( b"a", b"b", ): b"more content", (b"b", b""): b"boring content", }, maximum_size=10, key_width=2, search_key_func=search_key_func, ) self.assertEqual( {(b"a", b"a"): b"content here", (b"a", b"b"): b"more content"}, self.to_dict(chkmap, [(b"a",)]), ) def test_iteritems_keys_prefixed_by_2_width_one_leaf(self): """Test iteritems keys prefixed by 2 width one leaf.""" chkmap = self._get_map( { (b"a", b"a"): b"content here", ( b"a", b"b", ): b"more content", (b"b", b""): b"boring content", }, key_width=2, ) self.assertEqual( {(b"a", b"a"): b"content here", (b"a", b"b"): b"more content"}, self.to_dict(chkmap, [(b"a",)]), ) def test___len__empty(self): """Test len empty.""" chkmap = self._get_map({}) self.assertEqual(0, len(chkmap)) def test___len__2(self): """Test len 2.""" chkmap = self._get_map({(b"foo",): b"bar", (b"gam",): b"quux"}) self.assertEqual(2, len(chkmap)) def test_max_size_100_bytes_new(self): """Test max size 100 bytes new.""" # When there is a 100 byte upper node limit, a tree is formed. chkmap = self._get_map( {(b"k1" * 50,): b"v1", (b"k2" * 50,): b"v2"}, maximum_size=100 ) # We expect three nodes: # A root, with two children, and with two key prefixes - k1 to one, and # k2 to the other as our node splitting is only just being developed. # The maximum size should be embedded chkmap._ensure_root() self.assertEqual(100, chkmap._root_node.maximum_size) self.assertEqual(1, chkmap._root_node._key_width) # There should be two child nodes, and prefix of 2(bytes): self.assertEqual(2, len(chkmap._root_node._items)) self.assertEqual(b"k", chkmap._root_node._compute_search_prefix()) # The actual nodes pointed at will change as serialisers change; so # here we test that the key prefix is correct; then load the nodes and # check they have the right pointed at key; whether they have the # pointed at value inline or not is also unrelated to this test so we # don't check that in detail - rather we just check the aggregate # value. nodes = sorted(chkmap._root_node._items.items()) ptr1 = nodes[0] ptr2 = nodes[1] self.assertEqual(b"k1", ptr1[0]) self.assertEqual(b"k2", ptr2[0]) node1 = chk_map._deserialise(chkmap._read_bytes(ptr1[1]), ptr1[1], None) self.assertIsInstance(node1, LeafNode) self.assertEqual(1, len(node1)) self.assertEqual({(b"k1" * 50,): b"v1"}, self.to_dict(node1, chkmap._store)) node2 = chk_map._deserialise(chkmap._read_bytes(ptr2[1]), ptr2[1], None) self.assertIsInstance(node2, LeafNode) self.assertEqual(1, len(node2)) self.assertEqual({(b"k2" * 50,): b"v2"}, self.to_dict(node2, chkmap._store)) # Having checked we have a good structure, check that the content is # still accessible. self.assertEqual(2, len(chkmap)) self.assertEqual( {(b"k1" * 50,): b"v1", (b"k2" * 50,): b"v2"}, self.to_dict(chkmap) ) def test_init_root_is_LeafNode_new(self): """Test init root is LeafNode new.""" chk_bytes = self.get_chk_bytes() chkmap = CHKMap(chk_bytes, None) self.assertIsInstance(chkmap._root_node, LeafNode) self.assertEqual({}, self.to_dict(chkmap)) self.assertEqual(0, len(chkmap)) def test_init_and_save_new(self): """Test init and save new.""" chk_bytes = self.get_chk_bytes() chkmap = CHKMap(chk_bytes, None) key = chkmap._save() leaf_node = LeafNode() self.assertEqual([key], leaf_node.serialise(chk_bytes)) def test_map_first_item_new(self): """Test map first item new.""" chk_bytes = self.get_chk_bytes() chkmap = CHKMap(chk_bytes, None) chkmap.map((b"foo,",), b"bar") self.assertEqual({(b"foo,",): b"bar"}, self.to_dict(chkmap)) self.assertEqual(1, len(chkmap)) key = chkmap._save() leaf_node = LeafNode() leaf_node.map(chk_bytes, (b"foo,",), b"bar") self.assertEqual([key], leaf_node.serialise(chk_bytes)) def test_unmap_last_item_root_is_leaf_new(self): """Test unmap last item root is leaf new.""" chkmap = self._get_map({(b"k1" * 50,): b"v1", (b"k2" * 50,): b"v2"}) chkmap.unmap((b"k1" * 50,)) chkmap.unmap((b"k2" * 50,)) self.assertEqual(0, len(chkmap)) self.assertEqual({}, self.to_dict(chkmap)) key = chkmap._save() leaf_node = LeafNode() self.assertEqual([key], leaf_node.serialise(chkmap._store)) def test__dump_tree(self): """Test dump tree.""" chkmap = self._get_map( { (b"aaa",): b"value1", (b"aab",): b"value2", (b"bbb",): b"value3", }, maximum_size=15, ) self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'value1'\n" " 'aab' LeafNode\n" " ('aab',) 'value2'\n" " 'b' LeafNode\n" " ('bbb',) 'value3'\n", chkmap._dump_tree(), ) self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'value1'\n" " 'aab' LeafNode\n" " ('aab',) 'value2'\n" " 'b' LeafNode\n" " ('bbb',) 'value3'\n", chkmap._dump_tree(), ) self.assertEqualDiff( "'' InternalNode sha1:0690d471eb0a624f359797d0ee4672bd68f4e236\n" " 'a' InternalNode sha1:1514c35503da9418d8fd90c1bed553077cb53673\n" " 'aaa' LeafNode sha1:4cc5970454d40b4ce297a7f13ddb76f63b88fefb\n" " ('aaa',) 'value1'\n" " 'aab' LeafNode sha1:1d68bc90914ef8a3edbcc8bb28b00cb4fea4b5e2\n" " ('aab',) 'value2'\n" " 'b' LeafNode sha1:3686831435b5596515353364eab0399dc45d49e7\n" " ('bbb',) 'value3'\n", chkmap._dump_tree(include_keys=True), ) def test__dump_tree_in_progress(self): """Test dump tree in progress.""" chkmap = self._get_map( {(b"aaa",): b"value1", (b"aab",): b"value2"}, maximum_size=10 ) chkmap.map((b"bbb",), b"value3") self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'value1'\n" " 'aab' LeafNode\n" " ('aab',) 'value2'\n" " 'b' LeafNode\n" " ('bbb',) 'value3'\n", chkmap._dump_tree(), ) # For things that are updated by adding 'bbb', we don't have a sha key # for them yet, so they are listed as None self.assertEqualDiff( "'' InternalNode None\n" " 'a' InternalNode sha1:6b0d881dd739a66f733c178b24da64395edfaafd\n" " 'aaa' LeafNode sha1:40b39a08d895babce17b20ae5f62d187eaa4f63a\n" " ('aaa',) 'value1'\n" " 'aab' LeafNode sha1:ad1dc7c4e801302c95bf1ba7b20bc45e548cd51a\n" " ('aab',) 'value2'\n" " 'b' LeafNode None\n" " ('bbb',) 'value3'\n", chkmap._dump_tree(include_keys=True), ) def _search_key_single(key): """A search key function that maps all nodes to the same value.""" return b"value" def _test_search_key(key): return b"test:" + b"\x00".join(key) class TestMapSearchKeys(TestCaseWithStore): """Tests for Map Search Keys.""" def test_default_chk_map_uses_flat_search_key(self): """Test default chk map uses flat search key.""" chkmap = chk_map.CHKMap(self.get_chk_bytes(), None) self.assertEqual(b"1", chkmap._search_key_func((b"1",))) self.assertEqual(b"1\x002", chkmap._search_key_func((b"1", b"2"))) self.assertEqual(b"1\x002\x003", chkmap._search_key_func((b"1", b"2", b"3"))) def test_search_key_is_passed_to_root_node(self): """Test search key is passed to root node.""" chkmap = chk_map.CHKMap( self.get_chk_bytes(), None, search_key_func=_test_search_key ) self.assertIs(_test_search_key, chkmap._search_key_func) self.assertEqual( b"test:1\x002\x003", chkmap._search_key_func((b"1", b"2", b"3")) ) self.assertEqual( b"test:1\x002\x003", chkmap._root_node._search_key((b"1", b"2", b"3")) ) def test_search_key_passed_via__ensure_root(self): """Test search key passed via ensure root.""" chk_bytes = self.get_chk_bytes() chkmap = chk_map.CHKMap(chk_bytes, None, search_key_func=_test_search_key) root_key = chkmap._save() chkmap = chk_map.CHKMap(chk_bytes, root_key, search_key_func=_test_search_key) chkmap._ensure_root() self.assertEqual( b"test:1\x002\x003", chkmap._root_node._search_key((b"1", b"2", b"3")) ) def test_search_key_with_internal_node(self): """Test search key with internal node.""" chk_bytes = self.get_chk_bytes() chkmap = chk_map.CHKMap(chk_bytes, None, search_key_func=_test_search_key) chkmap._root_node.set_maximum_size(10) chkmap.map((b"1",), b"foo") chkmap.map((b"2",), b"bar") chkmap.map((b"3",), b"baz") self.assertEqualDiff( "'' InternalNode\n" " 'test:1' LeafNode\n" " ('1',) 'foo'\n" " 'test:2' LeafNode\n" " ('2',) 'bar'\n" " 'test:3' LeafNode\n" " ('3',) 'baz'\n", chkmap._dump_tree(), ) root_key = chkmap._save() chkmap = chk_map.CHKMap(chk_bytes, root_key, search_key_func=_test_search_key) self.assertEqualDiff( "'' InternalNode\n" " 'test:1' LeafNode\n" " ('1',) 'foo'\n" " 'test:2' LeafNode\n" " ('2',) 'bar'\n" " 'test:3' LeafNode\n" " ('3',) 'baz'\n", chkmap._dump_tree(), ) def test_search_key_16(self): """Test search key 16.""" chk_bytes = self.get_chk_bytes() chkmap = chk_map.CHKMap(chk_bytes, None, search_key_func=chk_map._search_key_16) chkmap._root_node.set_maximum_size(10) chkmap.map((b"1",), b"foo") chkmap.map((b"2",), b"bar") chkmap.map((b"3",), b"baz") self.assertEqualDiff( "'' InternalNode\n" " '1' LeafNode\n" " ('2',) 'bar'\n" " '6' LeafNode\n" " ('3',) 'baz'\n" " '8' LeafNode\n" " ('1',) 'foo'\n", chkmap._dump_tree(), ) root_key = chkmap._save() chkmap = chk_map.CHKMap( chk_bytes, root_key, search_key_func=chk_map._search_key_16 ) # We can get the values back correctly self.assertEqual([((b"1",), b"foo")], list(chkmap.iteritems([(b"1",)]))) self.assertEqualDiff( "'' InternalNode\n" " '1' LeafNode\n" " ('2',) 'bar'\n" " '6' LeafNode\n" " ('3',) 'baz'\n" " '8' LeafNode\n" " ('1',) 'foo'\n", chkmap._dump_tree(), ) def test_search_key_255(self): """Test search key 255.""" chk_bytes = self.get_chk_bytes() chkmap = chk_map.CHKMap( chk_bytes, None, search_key_func=chk_map._search_key_255 ) chkmap._root_node.set_maximum_size(10) chkmap.map((b"1",), b"foo") chkmap.map((b"2",), b"bar") chkmap.map((b"3",), b"baz") self.assertEqualDiff( "'' InternalNode\n" " '\\x1a' LeafNode\n" " ('2',) 'bar'\n" " 'm' LeafNode\n" " ('3',) 'baz'\n" " '\\x83' LeafNode\n" " ('1',) 'foo'\n", chkmap._dump_tree(encoding="latin1"), ) root_key = chkmap._save() chkmap = chk_map.CHKMap( chk_bytes, root_key, search_key_func=chk_map._search_key_255 ) # We can get the values back correctly self.assertEqual([((b"1",), b"foo")], list(chkmap.iteritems([(b"1",)]))) self.assertEqualDiff( "'' InternalNode\n" " '\\x1a' LeafNode\n" " ('2',) 'bar'\n" " 'm' LeafNode\n" " ('3',) 'baz'\n" " '\\x83' LeafNode\n" " ('1',) 'foo'\n", chkmap._dump_tree(encoding="latin1"), ) def test_search_key_collisions(self): """Test search key collisions.""" chkmap = chk_map.CHKMap( self.get_chk_bytes(), None, search_key_func=_search_key_single ) # The node will want to expand, but it cannot, because it knows that # all the keys must map to this node chkmap._root_node.set_maximum_size(20) chkmap.map((b"1",), b"foo") chkmap.map((b"2",), b"bar") chkmap.map((b"3",), b"baz") self.assertEqualDiff( "'' LeafNode\n ('1',) 'foo'\n ('2',) 'bar'\n ('3',) 'baz'\n", chkmap._dump_tree(), ) class TestLeafNode(TestCaseWithStore): """Tests for Leaf Node.""" def test_current_size_empty(self): """Test current size empty.""" node = LeafNode() self.assertEqual(16, node._current_size()) def test_current_size_size_changed(self): """Test current size size changed.""" node = LeafNode() node.set_maximum_size(10) self.assertEqual(17, node._current_size()) def test_current_size_width_changed(self): """Test current size width changed.""" node = LeafNode() node._key_width = 10 self.assertEqual(17, node._current_size()) def test_current_size_items(self): """Test current size items.""" node = LeafNode() base_size = node._current_size() node.map(None, (b"foo bar",), b"baz") self.assertEqual(base_size + 14, node._current_size()) def test_deserialise_empty(self): """Test deserialise empty.""" node = LeafNode.deserialise(b"chkleaf:\n10\n1\n0\n\n", (b"sha1:1234",)) self.assertEqual(0, len(node)) self.assertEqual(10, node.maximum_size) self.assertEqual((b"sha1:1234",), node.key()) self.assertIs(None, node._search_prefix) self.assertIs(None, node._common_serialised_prefix) def test_deserialise_items(self): """Test deserialise items.""" node = LeafNode.deserialise( b"chkleaf:\n0\n1\n2\n\nfoo bar\x001\nbaz\nquux\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo bar",), b"baz"), ((b"quux",), b"blarh")], sorted(node.iteritems(None)), ) def test_deserialise_item_with_null_width_1(self): """Test deserialise item with null width 1.""" node = LeafNode.deserialise( b"chkleaf:\n0\n1\n2\n\nfoo\x001\nbar\x00baz\nquux\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo",), b"bar\x00baz"), ((b"quux",), b"blarh")], sorted(node.iteritems(None)), ) def test_deserialise_item_with_null_width_2(self): """Test deserialise item with null width 2.""" node = LeafNode.deserialise( b"chkleaf:\n0\n2\n2\n\nfoo\x001\x001\nbar\x00baz\nquux\x00\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo", b"1"), b"bar\x00baz"), ((b"quux", b""), b"blarh")], sorted(node.iteritems(None)), ) def test_iteritems_selected_one_of_two_items(self): """Test iteritems selected one of two items.""" node = LeafNode.deserialise( b"chkleaf:\n0\n1\n2\n\nfoo bar\x001\nbaz\nquux\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"quux",), b"blarh")], sorted(node.iteritems(None, [(b"quux",), (b"qaz",)])), ) def test_deserialise_item_with_common_prefix(self): """Test deserialise item with common prefix.""" node = LeafNode.deserialise( b"chkleaf:\n0\n2\n2\nfoo\x00\n1\x001\nbar\x00baz\n2\x001\nblarh\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [((b"foo", b"1"), b"bar\x00baz"), ((b"foo", b"2"), b"blarh")], sorted(node.iteritems(None)), ) self.assertIs(chk_map._unknown, node._search_prefix) self.assertEqual(b"foo\x00", node._common_serialised_prefix) def test_deserialise_multi_line(self): """Test deserialise multi line.""" node = LeafNode.deserialise( b"chkleaf:\n0\n2\n2\nfoo\x00\n1\x002\nbar\nbaz\n2\x002\nblarh\n\n", (b"sha1:1234",), ) self.assertEqual(2, len(node)) self.assertEqual( [ ((b"foo", b"1"), b"bar\nbaz"), ((b"foo", b"2"), b"blarh\n"), ], sorted(node.iteritems(None)), ) self.assertIs(chk_map._unknown, node._search_prefix) self.assertEqual(b"foo\x00", node._common_serialised_prefix) def test_key_new(self): """Test key new.""" node = LeafNode() self.assertEqual(None, node.key()) def test_key_after_map(self): """Test key after map.""" node = LeafNode.deserialise(b"chkleaf:\n10\n1\n0\n\n", (b"sha1:1234",)) node.map(None, (b"foo bar",), b"baz quux") self.assertEqual(None, node.key()) def test_key_after_unmap(self): """Test key after unmap.""" node = LeafNode.deserialise( b"chkleaf:\n0\n1\n2\n\nfoo bar\x001\nbaz\nquux\x001\nblarh\n", (b"sha1:1234",), ) node.unmap(None, (b"foo bar",)) self.assertEqual(None, node.key()) def test_map_exceeding_max_size_only_entry_new(self): """Test map exceeding max size only entry new.""" node = LeafNode() node.set_maximum_size(10) result = node.map(None, (b"foo bar",), b"baz quux") self.assertEqual((b"foo bar", [(b"", node)]), result) self.assertLess(10, node._current_size()) def test_map_exceeding_max_size_second_entry_early_difference_new(self): """Test map exceeding max size second entry early difference new.""" node = LeafNode() node.set_maximum_size(10) node.map(None, (b"foo bar",), b"baz quux") prefix, result = list(node.map(None, (b"blue",), b"red")) self.assertEqual(b"", prefix) self.assertEqual(2, len(result)) split_chars = {result[0][0], result[1][0]} self.assertEqual({b"f", b"b"}, split_chars) nodes = dict(result) node = nodes[b"f"] self.assertEqual({(b"foo bar",): b"baz quux"}, self.to_dict(node, None)) self.assertEqual(10, node.maximum_size) self.assertEqual(1, node._key_width) node = nodes[b"b"] self.assertEqual({(b"blue",): b"red"}, self.to_dict(node, None)) self.assertEqual(10, node.maximum_size) self.assertEqual(1, node._key_width) def test_map_first(self): """Test map first.""" node = LeafNode() result = node.map(None, (b"foo bar",), b"baz quux") self.assertEqual((b"foo bar", [(b"", node)]), result) self.assertEqual({(b"foo bar",): b"baz quux"}, self.to_dict(node, None)) self.assertEqual(1, len(node)) def test_map_second(self): """Test map second.""" node = LeafNode() node.map(None, (b"foo bar",), b"baz quux") result = node.map(None, (b"bingo",), b"bango") self.assertEqual((b"", [(b"", node)]), result) self.assertEqual( {(b"foo bar",): b"baz quux", (b"bingo",): b"bango"}, self.to_dict(node, None), ) self.assertEqual(2, len(node)) def test_map_replacement(self): """Test map replacement.""" node = LeafNode() node.map(None, (b"foo bar",), b"baz quux") result = node.map(None, (b"foo bar",), b"bango") self.assertEqual((b"foo bar", [(b"", node)]), result) self.assertEqual({(b"foo bar",): b"bango"}, self.to_dict(node, None)) self.assertEqual(1, len(node)) def test_serialise_empty(self): """Test serialise empty.""" store = self.get_chk_bytes() node = LeafNode() node.set_maximum_size(10) expected_key = (b"sha1:f34c3f0634ea3f85953dffa887620c0a5b1f4a51",) self.assertEqual([expected_key], list(node.serialise(store))) self.assertEqual( b"chkleaf:\n10\n1\n0\n\n", self.read_bytes(store, expected_key) ) self.assertEqual(expected_key, node.key()) def test_serialise_items(self): """Test serialise items.""" store = self.get_chk_bytes() node = LeafNode() node.set_maximum_size(10) node.map(None, (b"foo bar",), b"baz quux") expected_key = (b"sha1:f89fac7edfc6bdb1b1b54a556012ff0c646ef5e0",) self.assertEqual(b"foo bar", node._common_serialised_prefix) self.assertEqual([expected_key], list(node.serialise(store))) self.assertEqual( b"chkleaf:\n10\n1\n1\nfoo bar\n\x001\nbaz quux\n", self.read_bytes(store, expected_key), ) self.assertEqual(expected_key, node.key()) def test_unique_serialised_prefix_empty_new(self): """Test unique serialised prefix empty new.""" node = LeafNode() self.assertIs(None, node._compute_search_prefix()) def test_unique_serialised_prefix_one_item_new(self): """Test unique serialised prefix one item new.""" node = LeafNode() node.map(None, (b"foo bar", b"baz"), b"baz quux") self.assertEqual(b"foo bar\x00baz", node._compute_search_prefix()) def test_unmap_missing(self): """Test unmap missing.""" node = LeafNode() self.assertRaises(KeyError, node.unmap, None, (b"foo bar",)) def test_unmap_present(self): """Test unmap present.""" node = LeafNode() node.map(None, (b"foo bar",), b"baz quux") result = node.unmap(None, (b"foo bar",)) self.assertEqual(node, result) self.assertEqual({}, self.to_dict(node, None)) self.assertEqual(0, len(node)) def test_map_maintains_common_prefixes(self): """Test map maintains common prefixes.""" node = LeafNode() node._key_width = 2 node.map(None, (b"foo bar", b"baz"), b"baz quux") self.assertEqual(b"foo bar\x00baz", node._search_prefix) self.assertEqual(b"foo bar\x00baz", node._common_serialised_prefix) node.map(None, (b"foo bar", b"bing"), b"baz quux") self.assertEqual(b"foo bar\x00b", node._search_prefix) self.assertEqual(b"foo bar\x00b", node._common_serialised_prefix) node.map(None, (b"fool", b"baby"), b"baz quux") self.assertEqual(b"foo", node._search_prefix) self.assertEqual(b"foo", node._common_serialised_prefix) node.map(None, (b"foo bar", b"baz"), b"replaced") self.assertEqual(b"foo", node._search_prefix) self.assertEqual(b"foo", node._common_serialised_prefix) node.map(None, (b"very", b"different"), b"value") self.assertEqual(b"", node._search_prefix) self.assertEqual(b"", node._common_serialised_prefix) def test_unmap_maintains_common_prefixes(self): """Test unmap maintains common prefixes.""" node = LeafNode() node._key_width = 2 node.map(None, (b"foo bar", b"baz"), b"baz quux") node.map(None, (b"foo bar", b"bing"), b"baz quux") node.map(None, (b"fool", b"baby"), b"baz quux") node.map(None, (b"very", b"different"), b"value") self.assertEqual(b"", node._search_prefix) self.assertEqual(b"", node._common_serialised_prefix) node.unmap(None, (b"very", b"different")) self.assertEqual(b"foo", node._search_prefix) self.assertEqual(b"foo", node._common_serialised_prefix) node.unmap(None, (b"fool", b"baby")) self.assertEqual(b"foo bar\x00b", node._search_prefix) self.assertEqual(b"foo bar\x00b", node._common_serialised_prefix) node.unmap(None, (b"foo bar", b"baz")) self.assertEqual(b"foo bar\x00bing", node._search_prefix) self.assertEqual(b"foo bar\x00bing", node._common_serialised_prefix) node.unmap(None, (b"foo bar", b"bing")) self.assertEqual(None, node._search_prefix) self.assertEqual(None, node._common_serialised_prefix) class TestInternalNode(TestCaseWithStore): """Tests for Internal Node.""" def test_add_node_empty_new(self): """Test add node empty new.""" node = InternalNode(b"fo") child = LeafNode() child.set_maximum_size(100) child.map(None, (b"foo",), b"bar") node.add_node(b"foo", child) # Note that node isn't strictly valid now as a tree (only one child), # but thats ok for this test. # The first child defines the node's width: self.assertEqual(3, node._node_width) # We should be able to iterate over the contents without doing IO. self.assertEqual({(b"foo",): b"bar"}, self.to_dict(node, None)) # The length should be known: self.assertEqual(1, len(node)) # serialising the node should serialise the child and the node. chk_bytes = self.get_chk_bytes() keys = list(node.serialise(chk_bytes)) child_key = child.serialise(chk_bytes)[0] self.assertEqual( [child_key, (b"sha1:cf67e9997d8228a907c1f5bfb25a8bd9cd916fac",)], keys ) # We should be able to access deserialised content. bytes = self.read_bytes(chk_bytes, keys[1]) node = chk_map._deserialise(bytes, keys[1], None) self.assertEqual(1, len(node)) self.assertEqual({(b"foo",): b"bar"}, self.to_dict(node, chk_bytes)) self.assertEqual(3, node._node_width) def test_add_node_resets_key_new(self): """Test add node resets key new.""" node = InternalNode(b"fo") child = LeafNode() child.set_maximum_size(100) child.map(None, (b"foo",), b"bar") node.add_node(b"foo", child) chk_bytes = self.get_chk_bytes() keys = list(node.serialise(chk_bytes)) self.assertEqual(keys[1], node._key) node.add_node(b"fos", child) self.assertEqual(None, node._key) # def test_add_node_empty_oversized_one_ok_new(self): # def test_add_node_one_oversized_second_kept_minimum_fan(self): # def test_add_node_two_oversized_third_kept_minimum_fan(self): # def test_add_node_one_oversized_second_splits_errors(self): def test__iter_nodes_no_key_filter(self): """Test iter nodes no key filter.""" node = InternalNode(b"") child = LeafNode() child.set_maximum_size(100) child.map(None, (b"foo",), b"bar") node.add_node(b"f", child) child = LeafNode() child.set_maximum_size(100) child.map(None, (b"bar",), b"baz") node.add_node(b"b", child) for _child, node_key_filter in node._iter_nodes(None, key_filter=None): self.assertEqual(None, node_key_filter) def test__iter_nodes_splits_key_filter(self): """Test iter nodes splits key filter.""" node = InternalNode(b"") child = LeafNode() child.set_maximum_size(100) child.map(None, (b"foo",), b"bar") node.add_node(b"f", child) child = LeafNode() child.set_maximum_size(100) child.map(None, (b"bar",), b"baz") node.add_node(b"b", child) # foo and bar both match exactly one leaf node, but 'cat' should not # match any, and should not be placed in one. key_filter = ((b"foo",), (b"bar",), (b"cat",)) for _child, node_key_filter in node._iter_nodes(None, key_filter=key_filter): # each child could only match one key filter, so make sure it was # properly filtered self.assertEqual(1, len(node_key_filter)) def test__iter_nodes_with_multiple_matches(self): """Test iter nodes with multiple matches.""" node = InternalNode(b"") child = LeafNode() child.set_maximum_size(100) child.map(None, (b"foo",), b"val") child.map(None, (b"fob",), b"val") node.add_node(b"f", child) child = LeafNode() child.set_maximum_size(100) child.map(None, (b"bar",), b"val") child.map(None, (b"baz",), b"val") node.add_node(b"b", child) # Note that 'ram' doesn't match anything, so it should be freely # ignored key_filter = ((b"foo",), (b"fob",), (b"bar",), (b"baz",), (b"ram",)) for _child, node_key_filter in node._iter_nodes(None, key_filter=key_filter): # each child could match two key filters, so make sure they were # both included. self.assertEqual(2, len(node_key_filter)) def make_fo_fa_node(self): """Make fo fa node.""" node = InternalNode(b"f") child = LeafNode() child.set_maximum_size(100) child.map(None, (b"foo",), b"val") child.map(None, (b"fob",), b"val") node.add_node(b"fo", child) child = LeafNode() child.set_maximum_size(100) child.map(None, (b"far",), b"val") child.map(None, (b"faz",), b"val") node.add_node(b"fa", child) return node def test__iter_nodes_single_entry(self): """Test iter nodes single entry.""" node = self.make_fo_fa_node() key_filter = [(b"foo",)] nodes = list(node._iter_nodes(None, key_filter=key_filter)) self.assertEqual(1, len(nodes)) self.assertEqual(key_filter, nodes[0][1]) def test__iter_nodes_single_entry_misses(self): """Test iter nodes single entry misses.""" node = self.make_fo_fa_node() key_filter = [(b"bar",)] nodes = list(node._iter_nodes(None, key_filter=key_filter)) self.assertEqual(0, len(nodes)) def test__iter_nodes_mixed_key_width(self): """Test iter nodes mixed key width.""" node = self.make_fo_fa_node() key_filter = [(b"foo", b"bar"), (b"foo",), (b"fo",), (b"b",)] nodes = list(node._iter_nodes(None, key_filter=key_filter)) self.assertEqual(1, len(nodes)) matches = key_filter[:] matches.remove((b"b",)) self.assertEqual(sorted(matches), sorted(nodes[0][1])) def test__iter_nodes_match_all(self): """Test iter nodes match all.""" node = self.make_fo_fa_node() key_filter = [(b"foo", b"bar"), (b"foo",), (b"fo",), (b"f",)] nodes = list(node._iter_nodes(None, key_filter=key_filter)) self.assertEqual(2, len(nodes)) def test__iter_nodes_fixed_widths_and_misses(self): """Test iter nodes fixed widths and misses.""" node = self.make_fo_fa_node() # foo and faa should both match one child, baz should miss key_filter = [(b"foo",), (b"faa",), (b"baz",)] nodes = list(node._iter_nodes(None, key_filter=key_filter)) self.assertEqual(2, len(nodes)) for _node, matches in nodes: self.assertEqual(1, len(matches)) def test_iteritems_empty_new(self): """Test iteritems empty new.""" node = InternalNode() self.assertEqual([], sorted(node.iteritems(None))) def test_iteritems_two_children(self): """Test iteritems two children.""" node = InternalNode() leaf1 = LeafNode() leaf1.map(None, (b"foo bar",), b"quux") leaf2 = LeafNode() leaf2.map(None, (b"strange",), b"beast") node.add_node(b"f", leaf1) node.add_node(b"s", leaf2) self.assertEqual( [((b"foo bar",), b"quux"), ((b"strange",), b"beast")], sorted(node.iteritems(None)), ) def test_iteritems_two_children_partial(self): """Test iteritems two children partial.""" node = InternalNode() leaf1 = LeafNode() leaf1.map(None, (b"foo bar",), b"quux") leaf2 = LeafNode() leaf2.map(None, (b"strange",), b"beast") node.add_node(b"f", leaf1) # This sets up a path that should not be followed - it will error if # the code tries to. node._items[b"f"] = None node.add_node(b"s", leaf2) self.assertEqual( [((b"strange",), b"beast")], sorted(node.iteritems(None, [(b"strange",), (b"weird",)])), ) def test_iteritems_two_children_with_hash(self): """Test iteritems two children with hash.""" search_key_func = chk_map.search_key_registry.get(b"hash-255-way") node = InternalNode(search_key_func=search_key_func) leaf1 = LeafNode(search_key_func=search_key_func) leaf1.map( None, (b"foo bar",), b"quux", ) leaf2 = LeafNode(search_key_func=search_key_func) leaf2.map( None, (b"strange",), b"beast", ) self.assertEqual( b"\xbeF\x014", search_key_func((b"foo bar",)), ) self.assertEqual( b"\x85\xfa\xf7K", search_key_func((b"strange",)), ) node.add_node(b"\xbe", leaf1) # This sets up a path that should not be followed - it will error if # the code tries to. node._items[b"\xbe"] = None node.add_node(b"\x85", leaf2) self.assertEqual( [((b"strange",), b"beast")], sorted( node.iteritems( None, [ (b"strange",), (b"weird",), ], ) ), ) def test_iteritems_partial_empty(self): """Test iteritems partial empty.""" node = InternalNode() self.assertEqual([], sorted(node.iteritems([(b"missing",)]))) def test_map_to_new_child_new(self): """Test map to new child new.""" chkmap = self._get_map({(b"k1",): b"foo", (b"k2",): b"bar"}, maximum_size=10) chkmap._ensure_root() node = chkmap._root_node # Ensure test validity: nothing paged in below the root. self.assertEqual( 2, len([value for value in node._items.values() if isinstance(value, tuple)]), ) # now, mapping to k3 should add a k3 leaf prefix, nodes = node.map(None, (b"k3",), b"quux") self.assertEqual(b"k", prefix) self.assertEqual([(b"", node)], nodes) # check new child details child = node._items[b"k3"] self.assertIsInstance(child, LeafNode) self.assertEqual(1, len(child)) self.assertEqual({(b"k3",): b"quux"}, self.to_dict(child, None)) self.assertEqual(None, child._key) self.assertEqual(10, child.maximum_size) self.assertEqual(1, child._key_width) # Check overall structure: self.assertEqual(3, len(chkmap)) self.assertEqual( {(b"k1",): b"foo", (b"k2",): b"bar", (b"k3",): b"quux"}, self.to_dict(chkmap), ) # serialising should only serialise the new data - k3 and the internal # node. keys = list(node.serialise(chkmap._store)) child_key = child.serialise(chkmap._store)[0] self.assertEqual([child_key, keys[1]], keys) def test_map_to_child_child_splits_new(self): """Test map to child child splits new.""" chkmap = self._get_map({(b"k1",): b"foo", (b"k22",): b"bar"}, maximum_size=10) # Check for the canonical root value for this tree: self.assertEqualDiff( "'' InternalNode\n" " 'k1' LeafNode\n" " ('k1',) 'foo'\n" " 'k2' LeafNode\n" " ('k22',) 'bar'\n", chkmap._dump_tree(), ) # _dump_tree pages everything in, so reload using just the root chkmap = CHKMap(chkmap._store, chkmap._root_node) chkmap._ensure_root() node = chkmap._root_node # Ensure test validity: nothing paged in below the root. self.assertEqual( 2, len([value for value in node._items.values() if isinstance(value, tuple)]), ) # now, mapping to k23 causes k22 ('k2' in node) to split into k22 and # k23, which for simplicity in the current implementation generates # a new internal node between node, and k22/k23. prefix, nodes = node.map(chkmap._store, (b"k23",), b"quux") self.assertEqual(b"k", prefix) self.assertEqual([(b"", node)], nodes) # check new child details child = node._items[b"k2"] self.assertIsInstance(child, InternalNode) self.assertEqual(2, len(child)) self.assertEqual( {(b"k22",): b"bar", (b"k23",): b"quux"}, self.to_dict(child, None) ) self.assertEqual(None, child._key) self.assertEqual(10, child.maximum_size) self.assertEqual(1, child._key_width) self.assertEqual(3, child._node_width) # Check overall structure: self.assertEqual(3, len(chkmap)) self.assertEqual( {(b"k1",): b"foo", (b"k22",): b"bar", (b"k23",): b"quux"}, self.to_dict(chkmap), ) # serialising should only serialise the new data - although k22 hasn't # changed because its a special corner case (splitting on with only one # key leaves one node unaltered), in general k22 is serialised, so we # expect k22, k23, the new internal node, and node, to be serialised. keys = list(node.serialise(chkmap._store)) child_key = child._key k22_key = child._items[b"k22"]._key k23_key = child._items[b"k23"]._key self.assertEqual({k22_key, k23_key, child_key, node.key()}, set(keys)) self.assertEqualDiff( "'' InternalNode\n" " 'k1' LeafNode\n" " ('k1',) 'foo'\n" " 'k2' InternalNode\n" " 'k22' LeafNode\n" " ('k22',) 'bar'\n" " 'k23' LeafNode\n" " ('k23',) 'quux'\n", chkmap._dump_tree(), ) def test__search_prefix_filter_with_hash(self): """Test search prefix filter with hash.""" search_key_func = chk_map.search_key_registry.get(b"hash-16-way") node = InternalNode(search_key_func=search_key_func) node._key_width = 2 node._node_width = 4 self.assertEqual(b"E8B7BE43\x0071BEEFF9", search_key_func((b"a", b"b"))) self.assertEqual(b"E8B7", node._search_prefix_filter((b"a", b"b"))) self.assertEqual( b"E8B7", node._search_prefix_filter((b"a",)), ) def test_unmap_k23_from_k1_k22_k23_gives_k1_k22_tree_new(self): """Test unmap k23 from k1 k22 k23 gives k1 k22 tree new.""" chkmap = self._get_map( {(b"k1",): b"foo", (b"k22",): b"bar", (b"k23",): b"quux"}, maximum_size=10 ) # Check we have the expected tree. self.assertEqualDiff( "'' InternalNode\n" " 'k1' LeafNode\n" " ('k1',) 'foo'\n" " 'k2' InternalNode\n" " 'k22' LeafNode\n" " ('k22',) 'bar'\n" " 'k23' LeafNode\n" " ('k23',) 'quux'\n", chkmap._dump_tree(), ) chkmap = CHKMap(chkmap._store, chkmap._root_node) chkmap._ensure_root() node = chkmap._root_node # unmapping k23 should give us a root, with k1 and k22 as direct # children. node.unmap(chkmap._store, (b"k23",)) # check the pointed-at object within node - k2 should now point at the # k22 leaf (which has been paged in to see if we can collapse the tree) child = node._items[b"k2"] self.assertIsInstance(child, LeafNode) self.assertEqual(1, len(child)) self.assertEqual({(b"k22",): b"bar"}, self.to_dict(child, None)) # Check overall structure is instact: self.assertEqual(2, len(chkmap)) self.assertEqual({(b"k1",): b"foo", (b"k22",): b"bar"}, self.to_dict(chkmap)) # serialising should only serialise the new data - the root node. keys = list(node.serialise(chkmap._store)) self.assertEqual([keys[-1]], keys) chkmap = CHKMap(chkmap._store, keys[-1]) self.assertEqualDiff( "'' InternalNode\n" " 'k1' LeafNode\n" " ('k1',) 'foo'\n" " 'k2' LeafNode\n" " ('k22',) 'bar'\n", chkmap._dump_tree(), ) def test_unmap_k1_from_k1_k22_k23_gives_k22_k23_tree_new(self): """Test unmap k1 from k1 k22 k23 gives k22 k23 tree new.""" chkmap = self._get_map( {(b"k1",): b"foo", (b"k22",): b"bar", (b"k23",): b"quux"}, maximum_size=10 ) self.assertEqualDiff( "'' InternalNode\n" " 'k1' LeafNode\n" " ('k1',) 'foo'\n" " 'k2' InternalNode\n" " 'k22' LeafNode\n" " ('k22',) 'bar'\n" " 'k23' LeafNode\n" " ('k23',) 'quux'\n", chkmap._dump_tree(), ) orig_root = chkmap._root_node chkmap = CHKMap(chkmap._store, orig_root) chkmap._ensure_root() node = chkmap._root_node k2_ptr = node._items[b"k2"] # unmapping k1 should give us a root, with k22 and k23 as direct # children, and should not have needed to page in the subtree. result = node.unmap(chkmap._store, (b"k1",)) self.assertEqual(k2_ptr, result) chkmap = CHKMap(chkmap._store, orig_root) # Unmapping at the CHKMap level should switch to the new root chkmap.unmap((b"k1",)) self.assertEqual(k2_ptr, chkmap._root_node) self.assertEqualDiff( "'' InternalNode\n" " 'k22' LeafNode\n" " ('k22',) 'bar'\n" " 'k23' LeafNode\n" " ('k23',) 'quux'\n", chkmap._dump_tree(), ) # leaf: # map -> fits - done # map -> doesn't fit - shrink from left till fits # key data to return: the common prefix, new nodes. # unmap -> how to tell if siblings can be combined. # combing leaf nodes means expanding the prefix to the left; so gather the size of # all the leaf nodes addressed by expanding the prefix by 1; if any adjacent node # is an internal node, we know that that is a dense subtree - can't combine. # otherwise as soon as the sum of serialised values exceeds the split threshold # we know we can't combine - stop. # unmap -> key return data - space in node, common prefix length? and key count # internal: # variable length prefixes? -> later start with fixed width to get something going # map -> fits - update pointer to leaf # return [prefix and node] - seems sound. # map -> doesn't fit - find unique prefix and shift right # create internal nodes for all the partitions, return list of unique # prefixes and nodes. # map -> new prefix - create a leaf # unmap -> if child key count 0, remove # unmap -> return space in node, common prefix length? (why?), key count # map: # map, if 1 node returned, use it, otherwise make an internal and populate. # map - unmap - if empty, use empty leafnode (avoids special cases in driver # code) # map inits as empty leafnode. # tools: # visualiser # how to handle: # AA, AB, AC, AD, BA # packed internal node - ideal: # AA, AB, AC, AD, BA # single byte fanout - A,B, AA,AB,AC,AD, BA # build order's: # BA # AB - split, but we want to end up with AB, BA, in one node, with # 1-4K get0 class TestCHKMapDifference(TestCaseWithExampleMaps): """Tests for CHKMap Difference.""" def get_difference(self, new_roots, old_roots, search_key_func=None): """Get difference.""" if search_key_func is None: search_key_func = chk_map._search_key_plain return chk_map.CHKMapDifference( self.get_chk_bytes(), new_roots, old_roots, search_key_func ) def test__init__(self): """Test init .""" c_map = self.make_root_only_map() key1 = c_map.key() c_map.map((b"aaa",), b"new aaa content") key2 = c_map._save() diff = self.get_difference([key2], [key1]) self.assertEqual({key1}, diff._all_old_chks) self.assertEqual([], diff._old_queue) self.assertEqual([], diff._new_queue) def help__read_all_roots(self, search_key_func): """Help read all roots.""" c_map = self.make_root_only_map(search_key_func=search_key_func) key1 = c_map.key() c_map.map((b"aaa",), b"new aaa content") key2 = c_map._save() diff = self.get_difference([key2], [key1], search_key_func) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key2], root_results) # We should have queued up only items that aren't in the old # set self.assertEqual([((b"aaa",), b"new aaa content")], diff._new_item_queue) self.assertEqual([], diff._new_queue) # And there are no old references, so that queue should be # empty self.assertEqual([], diff._old_queue) def test__read_all_roots_plain(self): """Test read all roots plain.""" self.help__read_all_roots(search_key_func=chk_map._search_key_plain) def test__read_all_roots_16(self): """Test read all roots 16.""" self.help__read_all_roots(search_key_func=chk_map._search_key_16) def test__read_all_roots_skips_known_old(self): """Test read all roots skips known old.""" c_map = self.make_one_deep_map(chk_map._search_key_plain) key1 = c_map.key() c_map2 = self.make_root_only_map(chk_map._search_key_plain) key2 = c_map2.key() diff = self.get_difference([key2], [key1], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] # We should have no results. key2 is completely contained within key1, # and we should have seen that in the first pass self.assertEqual([], root_results) def test__read_all_roots_prepares_queues(self): """Test read all roots prepares queues.""" c_map = self.make_one_deep_map(chk_map._search_key_plain) key1 = c_map.key() c_map._dump_tree() # load everything key1_a = c_map._root_node._items[b"a"].key() c_map.map((b"abb",), b"new abb content") key2 = c_map._save() key2_a = c_map._root_node._items[b"a"].key() diff = self.get_difference([key2], [key1], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key2], root_results) # At this point, we should have queued up only the 'a' Leaf on both # sides, both 'c' and 'd' are known to not have changed on both sides self.assertEqual([key2_a], diff._new_queue) self.assertEqual([], diff._new_item_queue) self.assertEqual([key1_a], diff._old_queue) def test__read_all_roots_multi_new_prepares_queues(self): """Test read all roots multi new prepares queues.""" c_map = self.make_one_deep_map(chk_map._search_key_plain) key1 = c_map.key() c_map._dump_tree() # load everything key1_a = c_map._root_node._items[b"a"].key() key1_c = c_map._root_node._items[b"c"].key() c_map.map((b"abb",), b"new abb content") key2 = c_map._save() key2_a = c_map._root_node._items[b"a"].key() c_map._root_node._items[b"c"].key() c_map = chk_map.CHKMap(self.get_chk_bytes(), key1, chk_map._search_key_plain) c_map.map((b"ccc",), b"new ccc content") key3 = c_map._save() c_map._root_node._items[b"a"].key() key3_c = c_map._root_node._items[b"c"].key() diff = self.get_difference([key2, key3], [key1], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual(sorted([key2, key3]), sorted(root_results)) # We should have queued up key2_a, and key3_c, but not key2_c or key3_c self.assertEqual({key2_a, key3_c}, set(diff._new_queue)) self.assertEqual([], diff._new_item_queue) # And we should have queued up both a and c for the old set self.assertEqual({key1_a, key1_c}, set(diff._old_queue)) def test__read_all_roots_different_depths(self): """Test read all roots different depths.""" c_map = self.make_two_deep_map(chk_map._search_key_plain) c_map._dump_tree() # load everything key1 = c_map.key() key1_a = c_map._root_node._items[b"a"].key() key1_c = c_map._root_node._items[b"c"].key() key1_d = c_map._root_node._items[b"d"].key() c_map2 = self.make_one_deep_two_prefix_map(chk_map._search_key_plain) c_map2._dump_tree() key2 = c_map2.key() key2_aa = c_map2._root_node._items[b"aa"].key() key2_ad = c_map2._root_node._items[b"ad"].key() diff = self.get_difference([key2], [key1], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key2], root_results) # Only the 'a' subset should be queued up, since 'c' and 'd' cannot be # present self.assertEqual([key1_a], diff._old_queue) self.assertEqual({key2_aa, key2_ad}, set(diff._new_queue)) self.assertEqual([], diff._new_item_queue) diff = self.get_difference([key1], [key2], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key1], root_results) self.assertEqual({key2_aa, key2_ad}, set(diff._old_queue)) self.assertEqual({key1_a, key1_c, key1_d}, set(diff._new_queue)) self.assertEqual([], diff._new_item_queue) def test__read_all_roots_different_depths_16(self): """Test read all roots different depths 16.""" c_map = self.make_two_deep_map(chk_map._search_key_16) c_map._dump_tree() # load everything key1 = c_map.key() key1_2 = c_map._root_node._items[b"2"].key() key1_4 = c_map._root_node._items[b"4"].key() key1_C = c_map._root_node._items[b"C"].key() key1_F = c_map._root_node._items[b"F"].key() c_map2 = self.make_one_deep_two_prefix_map(chk_map._search_key_16) c_map2._dump_tree() key2 = c_map2.key() key2_F0 = c_map2._root_node._items[b"F0"].key() key2_F3 = c_map2._root_node._items[b"F3"].key() key2_F4 = c_map2._root_node._items[b"F4"].key() key2_FD = c_map2._root_node._items[b"FD"].key() diff = self.get_difference([key2], [key1], chk_map._search_key_16) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key2], root_results) # Only the subset of keys that may be present should be queued up. self.assertEqual([key1_F], diff._old_queue) self.assertEqual( sorted([key2_F0, key2_F3, key2_F4, key2_FD]), sorted(diff._new_queue) ) self.assertEqual([], diff._new_item_queue) diff = self.get_difference([key1], [key2], chk_map._search_key_16) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key1], root_results) self.assertEqual( sorted([key2_F0, key2_F3, key2_F4, key2_FD]), sorted(diff._old_queue) ) self.assertEqual( sorted([key1_2, key1_4, key1_C, key1_F]), sorted(diff._new_queue) ) self.assertEqual([], diff._new_item_queue) def test__read_all_roots_mixed_depth(self): """Test read all roots mixed depth.""" c_map = self.make_one_deep_two_prefix_map(chk_map._search_key_plain) c_map._dump_tree() # load everything key1 = c_map.key() key1_aa = c_map._root_node._items[b"aa"].key() c_map._root_node._items[b"ad"].key() c_map2 = self.make_one_deep_one_prefix_map(chk_map._search_key_plain) c_map2._dump_tree() key2 = c_map2.key() key2_a = c_map2._root_node._items[b"a"].key() key2_b = c_map2._root_node._items[b"b"].key() diff = self.get_difference([key2], [key1], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key2], root_results) # 'ad' matches exactly 'a' on the other side, so it should be removed, # and neither side should have it queued for walking self.assertEqual([], diff._old_queue) self.assertEqual([key2_b], diff._new_queue) self.assertEqual([], diff._new_item_queue) diff = self.get_difference([key1], [key2], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key1], root_results) # Note: This is technically not the 'true minimal' set that we could # use The reason is that 'a' was matched exactly to 'ad' (by sha # sum). However, the code gets complicated in the case of more # than one interesting key, so for now, we live with this # Consider revising, though benchmarking showing it to be a # real-world issue should be done self.assertEqual([key2_a], diff._old_queue) # self.assertEqual([], diff._old_queue) self.assertEqual([key1_aa], diff._new_queue) self.assertEqual([], diff._new_item_queue) def test__read_all_roots_yields_extra_deep_records(self): """Test read all roots yields extra deep records.""" # This is slightly controversial, as we will yield a chk page that we # might later on find out could be filtered out. (If a root node is # referenced deeper in the old set.) # However, even with stacking, we always have all chk pages that we # will need. So as long as we filter out the referenced keys, we'll # never run into problems. # This allows us to yield a root node record immediately, without any # buffering. c_map = self.make_two_deep_map(chk_map._search_key_plain) c_map._dump_tree() # load all keys key1 = c_map.key() key1_a = c_map._root_node._items[b"a"].key() c_map2 = self.get_map( { (b"acc",): b"initial acc content", (b"ace",): b"initial ace content", }, maximum_size=100, ) self.assertEqualDiff( "'' LeafNode\n" " ('acc',) 'initial acc content'\n" " ('ace',) 'initial ace content'\n", c_map2._dump_tree(), ) key2 = c_map2.key() diff = self.get_difference([key2], [key1], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key2], root_results) # However, even though we have yielded the root node to be fetched, # we should have enqued all of the chk pages to be walked, so that we # can find the keys if they are present self.assertEqual([key1_a], diff._old_queue) self.assertEqual( { ((b"acc",), b"initial acc content"), ((b"ace",), b"initial ace content"), }, set(diff._new_item_queue), ) def test__read_all_roots_multiple_targets(self): """Test read all roots multiple targets.""" c_map = self.make_root_only_map() key1 = c_map.key() c_map = self.make_one_deep_map() key2 = c_map.key() c_map._dump_tree() key2_c = c_map._root_node._items[b"c"].key() key2_d = c_map._root_node._items[b"d"].key() c_map.map((b"ccc",), b"new ccc value") key3 = c_map._save() key3_c = c_map._root_node._items[b"c"].key() diff = self.get_difference([key2, key3], [key1], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual(sorted([key2, key3]), sorted(root_results)) self.assertEqual([], diff._old_queue) # the key 'd' is interesting from key2 and key3, but should only be # entered into the queue 1 time self.assertEqual(sorted([key2_c, key3_c, key2_d]), sorted(diff._new_queue)) self.assertEqual([], diff._new_item_queue) def test__read_all_roots_no_old(self): """Test read all roots no old.""" # This is the 'initial branch' case. With nothing in the old # set, we can just queue up all root nodes into interesting queue, and # then have them fast-path flushed via _flush_new_queue c_map = self.make_two_deep_map() key1 = c_map.key() diff = self.get_difference([key1], [], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([], root_results) self.assertEqual([], diff._old_queue) self.assertEqual([key1], diff._new_queue) self.assertEqual([], diff._new_item_queue) c_map2 = self.make_one_deep_map() key2 = c_map2.key() diff = self.get_difference([key1, key2], [], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([], root_results) self.assertEqual([], diff._old_queue) self.assertEqual(sorted([key1, key2]), sorted(diff._new_queue)) self.assertEqual([], diff._new_item_queue) def test__read_all_roots_no_old_16(self): """Test read all roots no old 16.""" c_map = self.make_two_deep_map(chk_map._search_key_16) key1 = c_map.key() diff = self.get_difference([key1], [], chk_map._search_key_16) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([], root_results) self.assertEqual([], diff._old_queue) self.assertEqual([key1], diff._new_queue) self.assertEqual([], diff._new_item_queue) c_map2 = self.make_one_deep_map(chk_map._search_key_16) key2 = c_map2.key() diff = self.get_difference([key1, key2], [], chk_map._search_key_16) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([], root_results) self.assertEqual([], diff._old_queue) self.assertEqual(sorted([key1, key2]), sorted(diff._new_queue)) self.assertEqual([], diff._new_item_queue) def test__read_all_roots_multiple_old(self): """Test read all roots multiple old.""" c_map = self.make_two_deep_map() key1 = c_map.key() c_map._dump_tree() # load everything key1_a = c_map._root_node._items[b"a"].key() c_map.map((b"ccc",), b"new ccc value") key2 = c_map._save() c_map._root_node._items[b"a"].key() c_map.map((b"add",), b"new add value") key3 = c_map._save() key3_a = c_map._root_node._items[b"a"].key() diff = self.get_difference([key3], [key1, key2], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key3], root_results) # the 'a' keys should not be queued up 2 times, since they are # identical self.assertEqual([key1_a], diff._old_queue) self.assertEqual([key3_a], diff._new_queue) self.assertEqual([], diff._new_item_queue) def test__process_next_old_batched_no_dupes(self): """Test process next old batched no dupes.""" c_map = self.make_two_deep_map() key1 = c_map.key() c_map._dump_tree() # load everything key1_a = c_map._root_node._items[b"a"].key() key1_aa = c_map._root_node._items[b"a"]._items[b"aa"].key() key1_ab = c_map._root_node._items[b"a"]._items[b"ab"].key() key1_ac = c_map._root_node._items[b"a"]._items[b"ac"].key() key1_ad = c_map._root_node._items[b"a"]._items[b"ad"].key() c_map.map((b"aaa",), b"new aaa value") key2 = c_map._save() key2_a = c_map._root_node._items[b"a"].key() key2_aa = c_map._root_node._items[b"a"]._items[b"aa"].key() c_map.map((b"acc",), b"new acc content") key3 = c_map._save() key3_a = c_map._root_node._items[b"a"].key() c_map._root_node._items[b"a"]._items[b"ac"].key() diff = self.get_difference([key3], [key1, key2], chk_map._search_key_plain) root_results = [record.key for record in diff._read_all_roots()] self.assertEqual([key3], root_results) self.assertEqual(sorted([key1_a, key2_a]), sorted(diff._old_queue)) self.assertEqual([key3_a], diff._new_queue) self.assertEqual([], diff._new_item_queue) diff._process_next_old() # All of the old records should be brought in and queued up, # but we should not have any duplicates self.assertEqual( sorted([key1_aa, key1_ab, key1_ac, key1_ad, key2_aa]), sorted(diff._old_queue), ) class TestIterInterestingNodes(TestCaseWithExampleMaps): """Tests for Iter Interesting Nodes.""" def get_map_key(self, a_dict, maximum_size=10): """Get map key.""" c_map = self.get_map(a_dict, maximum_size=maximum_size) return c_map.key() def assertIterInteresting(self, records, items, interesting_keys, old_keys): """Check the result of iter_interesting_nodes. Note that we no longer care how many steps are taken, etc, just that the right contents are returned. :param records: A list of record keys that should be yielded :param items: A list of items (key,value) that should be yielded. """ store = self.get_chk_bytes() store._search_key_func = chk_map._search_key_plain iter_nodes = chk_map.iter_interesting_nodes(store, interesting_keys, old_keys) record_keys = [] all_items = [] for record, new_items in iter_nodes: if record is not None: record_keys.append(record.key) if new_items: all_items.extend(new_items) self.assertEqual(sorted(records), sorted(record_keys)) self.assertEqual(sorted(items), sorted(all_items)) def test_empty_to_one_keys(self): """Test empty to one keys.""" target = self.get_map_key({(b"a",): b"content"}) self.assertIterInteresting([target], [((b"a",), b"content")], [target], []) def test_none_to_one_key(self): """Test none to one key.""" basis = self.get_map_key({}) target = self.get_map_key({(b"a",): b"content"}) self.assertIterInteresting([target], [((b"a",), b"content")], [target], [basis]) def test_one_to_none_key(self): """Test one to none key.""" basis = self.get_map_key({(b"a",): b"content"}) target = self.get_map_key({}) self.assertIterInteresting([target], [], [target], [basis]) def test_common_pages(self): """Test common pages.""" basis = self.get_map_key( { (b"a",): b"content", (b"b",): b"content", (b"c",): b"content", } ) target = self.get_map_key( { (b"a",): b"content", (b"b",): b"other content", (b"c",): b"content", } ) target_map = CHKMap(self.get_chk_bytes(), target) self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('a',) 'content'\n" " 'b' LeafNode\n" " ('b',) 'other content'\n" " 'c' LeafNode\n" " ('c',) 'content'\n", target_map._dump_tree(), ) b_key = target_map._root_node._items[b"b"].key() # This should return the root node, and the node for the 'b' key self.assertIterInteresting( [target, b_key], [((b"b",), b"other content")], [target], [basis] ) def test_common_sub_page(self): """Test common sub page.""" basis = self.get_map_key( { (b"aaa",): b"common", (b"c",): b"common", } ) target = self.get_map_key( { (b"aaa",): b"common", (b"aab",): b"new", (b"c",): b"common", } ) target_map = CHKMap(self.get_chk_bytes(), target) self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'common'\n" " 'aab' LeafNode\n" " ('aab',) 'new'\n" " 'c' LeafNode\n" " ('c',) 'common'\n", target_map._dump_tree(), ) # The key for the internal aa node a_key = target_map._root_node._items[b"a"].key() # The key for the leaf aab node # aaa_key = target_map._root_node._items['a']._items['aaa'].key() aab_key = target_map._root_node._items[b"a"]._items[b"aab"].key() self.assertIterInteresting( [target, a_key, aab_key], [((b"aab",), b"new")], [target], [basis] ) def test_common_leaf(self): """Test common leaf.""" basis = self.get_map_key({}) target1 = self.get_map_key({(b"aaa",): b"common"}) target2 = self.get_map_key( { (b"aaa",): b"common", (b"bbb",): b"new", } ) target3 = self.get_map_key( { (b"aaa",): b"common", (b"aac",): b"other", (b"bbb",): b"new", } ) # The LeafNode containing 'aaa': 'common' occurs at 3 different levels. # Once as a root node, once as a second layer, and once as a third # layer. It should only be returned one time regardless target1_map = CHKMap(self.get_chk_bytes(), target1) self.assertEqualDiff( "'' LeafNode\n ('aaa',) 'common'\n", target1_map._dump_tree() ) target2_map = CHKMap(self.get_chk_bytes(), target2) self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('aaa',) 'common'\n" " 'b' LeafNode\n" " ('bbb',) 'new'\n", target2_map._dump_tree(), ) target3_map = CHKMap(self.get_chk_bytes(), target3) self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'common'\n" " 'aac' LeafNode\n" " ('aac',) 'other'\n" " 'b' LeafNode\n" " ('bbb',) 'new'\n", target3_map._dump_tree(), ) target1_map._root_node.key() b_key = target2_map._root_node._items[b"b"].key() a_key = target3_map._root_node._items[b"a"].key() aac_key = target3_map._root_node._items[b"a"]._items[b"aac"].key() self.assertIterInteresting( [target1, target2, target3, a_key, aac_key, b_key], [((b"aaa",), b"common"), ((b"bbb",), b"new"), ((b"aac",), b"other")], [target1, target2, target3], [basis], ) self.assertIterInteresting( [target2, target3, a_key, aac_key, b_key], [((b"bbb",), b"new"), ((b"aac",), b"other")], [target2, target3], [target1], ) # Technically, target1 could be filtered out, but since it is a root # node, we yield it immediately, rather than waiting to find out much # later on. self.assertIterInteresting([target1], [], [target1], [target3]) def test_multiple_maps(self): """Test multiple maps.""" basis1 = self.get_map_key( { (b"aaa",): b"common", (b"aab",): b"basis1", } ) basis2 = self.get_map_key( { (b"bbb",): b"common", (b"bbc",): b"basis2", } ) target1 = self.get_map_key( { (b"aaa",): b"common", (b"aac",): b"target1", (b"bbb",): b"common", } ) target2 = self.get_map_key( { (b"aaa",): b"common", (b"bba",): b"target2", (b"bbb",): b"common", } ) target1_map = CHKMap(self.get_chk_bytes(), target1) self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aaa' LeafNode\n" " ('aaa',) 'common'\n" " 'aac' LeafNode\n" " ('aac',) 'target1'\n" " 'b' LeafNode\n" " ('bbb',) 'common'\n", target1_map._dump_tree(), ) # The key for the target1 internal a node a_key = target1_map._root_node._items[b"a"].key() # The key for the leaf aac node aac_key = target1_map._root_node._items[b"a"]._items[b"aac"].key() target2_map = CHKMap(self.get_chk_bytes(), target2) self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('aaa',) 'common'\n" " 'b' InternalNode\n" " 'bba' LeafNode\n" " ('bba',) 'target2'\n" " 'bbb' LeafNode\n" " ('bbb',) 'common'\n", target2_map._dump_tree(), ) # The key for the target2 internal bb node b_key = target2_map._root_node._items[b"b"].key() # The key for the leaf bba node bba_key = target2_map._root_node._items[b"b"]._items[b"bba"].key() self.assertIterInteresting( [target1, target2, a_key, aac_key, b_key, bba_key], [((b"aac",), b"target1"), ((b"bba",), b"target2")], [target1, target2], [basis1, basis2], ) def test_multiple_maps_overlapping_common_new(self): """Test multiple maps overlapping common new.""" # Test that when a node found through the interesting_keys iteration # for *some roots* and also via the old keys iteration, that # it is still scanned for old refs and items, because its # not truely new. This requires 2 levels of InternalNodes to expose, # because of the way the bootstrap in _find_children_info works. # This suggests that the code is probably amenable to/benefit from # consolidation. # How does this test work? # 1) We need a second level InternalNode present in a basis tree. # 2) We need a left side new tree that uses that InternalNode # 3) We need a right side new tree that does not use that InternalNode # at all but that has an unchanged *value* that was reachable inside # that InternalNode basis = self.get_map_key( { # InternalNode, unchanged in left: (b"aaa",): b"left", (b"abb",): b"right", # Forces an internalNode at 'a' (b"ccc",): b"common", } ) left = self.get_map_key( { # All of basis unchanged (b"aaa",): b"left", (b"abb",): b"right", (b"ccc",): b"common", # And a new top level node so the root key is different (b"ddd",): b"change", } ) right = self.get_map_key( { # A value that is unchanged from basis and thus should be filtered # out. (b"abb",): b"right" } ) basis_map = CHKMap(self.get_chk_bytes(), basis) self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aa' LeafNode\n" " ('aaa',) 'left'\n" " 'ab' LeafNode\n" " ('abb',) 'right'\n" " 'c' LeafNode\n" " ('ccc',) 'common'\n", basis_map._dump_tree(), ) # Get left expected data left_map = CHKMap(self.get_chk_bytes(), left) self.assertEqualDiff( "'' InternalNode\n" " 'a' InternalNode\n" " 'aa' LeafNode\n" " ('aaa',) 'left'\n" " 'ab' LeafNode\n" " ('abb',) 'right'\n" " 'c' LeafNode\n" " ('ccc',) 'common'\n" " 'd' LeafNode\n" " ('ddd',) 'change'\n", left_map._dump_tree(), ) # Keys from left side target l_d_key = left_map._root_node._items[b"d"].key() # Get right expected data right_map = CHKMap(self.get_chk_bytes(), right) self.assertEqualDiff( "'' LeafNode\n ('abb',) 'right'\n", right_map._dump_tree() ) # Keys from the right side target - none, the root is enough. # Test behaviour self.assertIterInteresting( [right, left, l_d_key], [((b"ddd",), b"change")], [left, right], [basis] ) def test_multiple_maps_similar(self): """Test multiple maps similar.""" # We want to have a depth=2 tree, with multiple entries in each leaf # node basis = self.get_map_key( { (b"aaa",): b"unchanged", (b"abb",): b"will change left", (b"caa",): b"unchanged", (b"cbb",): b"will change right", }, maximum_size=60, ) left = self.get_map_key( { (b"aaa",): b"unchanged", (b"abb",): b"changed left", (b"caa",): b"unchanged", (b"cbb",): b"will change right", }, maximum_size=60, ) right = self.get_map_key( { (b"aaa",): b"unchanged", (b"abb",): b"will change left", (b"caa",): b"unchanged", (b"cbb",): b"changed right", }, maximum_size=60, ) basis_map = CHKMap(self.get_chk_bytes(), basis) self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('aaa',) 'unchanged'\n" " ('abb',) 'will change left'\n" " 'c' LeafNode\n" " ('caa',) 'unchanged'\n" " ('cbb',) 'will change right'\n", basis_map._dump_tree(), ) # Get left expected data left_map = CHKMap(self.get_chk_bytes(), left) self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('aaa',) 'unchanged'\n" " ('abb',) 'changed left'\n" " 'c' LeafNode\n" " ('caa',) 'unchanged'\n" " ('cbb',) 'will change right'\n", left_map._dump_tree(), ) # Keys from left side target l_a_key = left_map._root_node._items[b"a"].key() left_map._root_node._items[b"c"].key() # Get right expected data right_map = CHKMap(self.get_chk_bytes(), right) self.assertEqualDiff( "'' InternalNode\n" " 'a' LeafNode\n" " ('aaa',) 'unchanged'\n" " ('abb',) 'will change left'\n" " 'c' LeafNode\n" " ('caa',) 'unchanged'\n" " ('cbb',) 'changed right'\n", right_map._dump_tree(), ) right_map._root_node._items[b"a"].key() r_c_key = right_map._root_node._items[b"c"].key() self.assertIterInteresting( [right, left, l_a_key, r_c_key], [((b"abb",), b"changed left"), ((b"cbb",), b"changed right")], [left, right], [basis], ) class TestSearchKeys(TestCase): """Tests for Search Keys.""" def assertSearchKey16(self, expected, key): """Assert SearchKey16.""" self.assertEqual(expected, _search_key_16(key)) def assertSearchKey255(self, expected, key): """Assert SearchKey255.""" actual = _search_key_255(key) self.assertEqual(expected, actual, f"actual: {actual!r}") def test_simple_16(self): """Test simple 16.""" self.assertSearchKey16( b"8C736521", (b"foo",), ) self.assertSearchKey16(b"8C736521\x008C736521", (b"foo", b"foo")) self.assertSearchKey16(b"8C736521\x0076FF8CAA", (b"foo", b"bar")) self.assertSearchKey16( b"ED82CD11", (b"abcd",), ) def test_simple_255(self): """Test simple 255.""" self.assertSearchKey255( b"\x8cse!", (b"foo",), ) self.assertSearchKey255(b"\x8cse!\x00\x8cse!", (b"foo", b"foo")) self.assertSearchKey255(b"\x8cse!\x00v\xff\x8c\xaa", (b"foo", b"bar")) # The standard mapping for these would include '\n', so it should be # mapped to '_' self.assertSearchKey255(b"\xfdm\x93_\x00P_\x1bL", (b"<", b"V")) def test_255_does_not_include_newline(self): """Test 255 does not include newline.""" # When mapping via _search_key_255, we should never have the '\n' # character, but all other 255 values should be present chars_used = set() for char_in in range(256): search_key = _search_key_255((bytes([char_in]),)) chars_used.update([bytes([x]) for x in search_key]) all_chars = {bytes([x]) for x in range(256)} unused_chars = all_chars.symmetric_difference(chars_used) self.assertEqual({b"\n"}, unused_chars) class Test_BytesToTextKey(TestCase): """Tests for _bytes_to_text_key.""" def assertBytesToTextKey(self, key, bytes): """Assert BytesToTextKey.""" self.assertEqual(key, _bytes_to_text_key(bytes)) def assertBytesToTextKeyRaises(self, bytes): """Assert BytesToTextKeyRaises.""" # These are invalid bytes, and we want to make sure the code under test # raises an exception rather than segfaults, etc. We don't particularly # care what exception. self.assertRaises((ValueError, IndexError), _bytes_to_text_key, bytes) def test_file(self): """Test file.""" self.assertBytesToTextKey( (b"file-id", b"revision-id"), b"file: file-id\nparent-id\nname\nrevision-id\n" b"da39a3ee5e6b4b0d3255bfef95601890afd80709\n100\nN", ) def test_invalid_no_kind(self): """Test invalid no kind.""" self.assertBytesToTextKeyRaises( b"file file-id\nparent-id\nname\nrevision-id\n" b"da39a3ee5e6b4b0d3255bfef95601890afd80709\n100\nN" ) def test_invalid_no_space(self): """Test invalid no space.""" self.assertBytesToTextKeyRaises( b"file:file-id\nparent-id\nname\nrevision-id\n" b"da39a3ee5e6b4b0d3255bfef95601890afd80709\n100\nN" ) def test_invalid_too_short_file_id(self): """Test invalid too short file id.""" self.assertBytesToTextKeyRaises(b"file:file-id") def test_invalid_too_short_parent_id(self): """Test invalid too short parent id.""" self.assertBytesToTextKeyRaises(b"file:file-id\nparent-id") def test_invalid_too_short_name(self): """Test invalid too short name.""" self.assertBytesToTextKeyRaises(b"file:file-id\nparent-id\nname") def test_dir(self): """Test dir.""" self.assertBytesToTextKey( (b"dir-id", b"revision-id"), b"dir: dir-id\nparent-id\nname\nrevision-id" ) bzrformats_3.4.0.orig/bzrformats/tests/test_chk_serializer.py0000644000000000000000000001123615162115103021616 0ustar00# Copyright (C) 2009, 2010, 2011, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA from .._bzr_rs import revision_bencode_serializer from ..revision import Revision from . import TestCase _working_revision_bencode1 = ( b"l" b"l6:formati10ee" b"l9:committer54:Canonical.com Patch Queue Manager e" b"l8:timezonei3600ee" b"l10:propertiesd11:branch-nick6:+trunkee" b"l9:timestamp14:1242300770.844e" b"l11:revision-id50:pqm@pqm.ubuntu.com-20090514113250-jntkkpminfn3e0tze" b"l10:parent-ids" b"l" b"50:pqm@pqm.ubuntu.com-20090514104039-kggemn7lrretzpvc" b"48:jelmer@samba.org-20090510012654-jp9ufxquekaokbeo" b"ee" b"l14:inventory-sha140:4a2c7fb50e077699242cf6eb16a61779c7b680a7e" b"l7:message35:(Jelmer) Move dpush to InterBranch.e" b"e" ) _working_revision_bencode1_no_timezone = ( b"l" b"l6:formati10ee" b"l9:committer54:Canonical.com Patch Queue Manager e" b"l9:timestamp14:1242300770.844e" b"l10:propertiesd11:branch-nick6:+trunkee" b"l11:revision-id50:pqm@pqm.ubuntu.com-20090514113250-jntkkpminfn3e0tze" b"l10:parent-ids" b"l" b"50:pqm@pqm.ubuntu.com-20090514104039-kggemn7lrretzpvc" b"48:jelmer@samba.org-20090510012654-jp9ufxquekaokbeo" b"ee" b"l14:inventory-sha140:4a2c7fb50e077699242cf6eb16a61779c7b680a7e" b"l7:message35:(Jelmer) Move dpush to InterBranch.e" b"e" ) class TestBEncodeSerializer1(TestCase): """Test BEncode serialization.""" def test_unpack_revision(self): """Test unpacking a revision.""" rev = revision_bencode_serializer.read_revision_from_string( _working_revision_bencode1 ) self.assertEqual( rev.committer, "Canonical.com Patch Queue Manager " ) self.assertEqual( rev.inventory_sha1, b"4a2c7fb50e077699242cf6eb16a61779c7b680a7" ) self.assertEqual( [ b"pqm@pqm.ubuntu.com-20090514104039-kggemn7lrretzpvc", b"jelmer@samba.org-20090510012654-jp9ufxquekaokbeo", ], rev.parent_ids, ) self.assertEqual("(Jelmer) Move dpush to InterBranch.", rev.message) self.assertEqual( b"pqm@pqm.ubuntu.com-20090514113250-jntkkpminfn3e0tz", rev.revision_id ) self.assertEqual({"branch-nick": "+trunk"}, rev.properties) self.assertEqual(3600, rev.timezone) def test_written_form_matches(self): rev = revision_bencode_serializer.read_revision_from_string( _working_revision_bencode1 ) as_str = revision_bencode_serializer.write_revision_to_string(rev) self.assertEqualDiff(_working_revision_bencode1, as_str) def test_unpack_revision_no_timezone(self): rev = revision_bencode_serializer.read_revision_from_string( _working_revision_bencode1_no_timezone ) self.assertEqual(None, rev.timezone) def assertRoundTrips(self, serializer, orig_rev): lines = serializer.write_revision_to_lines(orig_rev) new_rev = serializer.read_revision_from_string(b"".join(lines)) self.assertEqual(orig_rev, new_rev) def test_roundtrips_non_ascii(self): rev = Revision( b"revid1", message="\n\xe5me", committer="Erik B\xe5gfors", timestamp=1242385452, inventory_sha1=b"4a2c7fb50e077699242cf6eb16a61779c7b680a7", parent_ids=[], properties={}, timezone=3600, ) self.assertRoundTrips(revision_bencode_serializer, rev) def test_roundtrips_xml_invalid_chars(self): rev = Revision( b"revid1", properties={}, parent_ids=[], message="\t\ue000", committer="Erik B\xe5gfors", timestamp=1242385452, timezone=3600, inventory_sha1=b"4a2c7fb50e077699242cf6eb16a61779c7b680a7", ) self.assertRoundTrips(revision_bencode_serializer, rev) bzrformats_3.4.0.orig/bzrformats/tests/test_chunk_writer.py0000644000000000000000000001111215162115103021315 0ustar00# Copyright (C) 2008 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for writing fixed size chunks with compression.""" import zlib from .. import chunk_writer from . import TestCase class TestWriter(TestCase): def check_chunk(self, bytes_list, size): data = b"".join(bytes_list) self.assertEqual(size, len(data)) return zlib.decompress(data) def test_chunk_writer_empty(self): writer = chunk_writer.ChunkWriter(4096) bytes_list, unused, padding = writer.finish() node_bytes = self.check_chunk(bytes_list, 4096) self.assertEqual(b"", node_bytes) self.assertEqual(None, unused) # Only a zlib header. self.assertEqual(4088, padding) def test_optimize_for_speed(self): writer = chunk_writer.ChunkWriter(4096) writer.set_optimize(for_size=False) self.assertEqual( chunk_writer.ChunkWriter._repack_opts_for_speed, (writer._max_repack, writer._max_zsync), ) writer = chunk_writer.ChunkWriter(4096, optimize_for_size=False) self.assertEqual( chunk_writer.ChunkWriter._repack_opts_for_speed, (writer._max_repack, writer._max_zsync), ) def test_optimize_for_size(self): writer = chunk_writer.ChunkWriter(4096) writer.set_optimize(for_size=True) self.assertEqual( chunk_writer.ChunkWriter._repack_opts_for_size, (writer._max_repack, writer._max_zsync), ) writer = chunk_writer.ChunkWriter(4096, optimize_for_size=True) self.assertEqual( chunk_writer.ChunkWriter._repack_opts_for_size, (writer._max_repack, writer._max_zsync), ) def test_some_data(self): writer = chunk_writer.ChunkWriter(4096) writer.write(b"foo bar baz quux\n") bytes_list, unused, padding = writer.finish() node_bytes = self.check_chunk(bytes_list, 4096) self.assertEqual(b"foo bar baz quux\n", node_bytes) self.assertEqual(None, unused) # More than just the header.. self.assertEqual(4073, padding) @staticmethod def _make_lines(): lines = [] for group in range(48): offset = group * 50 numbers = list(range(offset, offset + 50)) # Create a line with this group lines.append(b"".join(b"%d" % n for n in numbers) + b"\n") return lines def test_too_much_data_does_not_exceed_size(self): # Generate enough data to exceed 4K lines = self._make_lines() writer = chunk_writer.ChunkWriter(4096) for idx, line in enumerate(lines): if writer.write(line): self.assertEqual(46, idx) break bytes_list, unused, _ = writer.finish() node_bytes = self.check_chunk(bytes_list, 4096) # the first 46 lines should have been added expected_bytes = b"".join(lines[:46]) self.assertEqualDiff(expected_bytes, node_bytes) # And the line that failed should have been saved for us self.assertEqual(lines[46], unused) def test_too_much_data_preserves_reserve_space(self): # Generate enough data to exceed 4K lines = self._make_lines() writer = chunk_writer.ChunkWriter(4096, 256) for idx, line in enumerate(lines): if writer.write(line): self.assertEqual(44, idx) break else: self.fail("We were able to write all lines") self.assertFalse(writer.write(b"A" * 256, reserved=True)) bytes_list, unused, _ = writer.finish() node_bytes = self.check_chunk(bytes_list, 4096) # the first 44 lines should have been added expected_bytes = b"".join(lines[:44]) + b"A" * 256 self.assertEqualDiff(expected_bytes, node_bytes) # And the line that failed should have been saved for us self.assertEqual(lines[44], unused) bzrformats_3.4.0.orig/bzrformats/tests/test_dirstate.py0000644000000000000000000012035015162115103020435 0ustar00# Copyright (C) 2006-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests of the dirstate functionality being built for WorkingTreeFormat4.""" import binascii import bisect import os import struct from testscenarios import load_tests_apply_scenarios from bzrformats import osutils from .. import dirstate, inventory from . import TestCase, TestCaseInTempDir, dir_reader_scenarios # TODO: # TESTS to write: # general checks for NOT_IN_MEMORY error conditions. # set_path_id on a NOT_IN_MEMORY dirstate # set_path_id unicode support # set_path_id setting id of a path not root # set_path_id setting id when there are parents without the id in the parents # set_path_id setting id when there are parents with the id in the parents # set_path_id setting id when state is not in memory # set_path_id setting id when state is in memory unmodified # set_path_id setting id when state is in memory modified class TestErrors(TestCase): def test_dirstate_corrupt(self): error = dirstate.DirstateCorrupt( ".bzr/checkout/dirstate", 'trailing garbage: "x"' ) self.assertEqualDiff( "The dirstate file (.bzr/checkout/dirstate)" ' appears to be corrupt: trailing garbage: "x"', str(error), ) load_tests = load_tests_apply_scenarios class TestCaseWithDirState: """Helper methods for creating DirState objects. Inherit from this alongside a TestCase that provides a temp directory. """ scenarios = dir_reader_scenarios() # Set by load_tests _dir_reader_class = None _native_to_unicode = None # Not used yet def setUp(self): super().setUp() if self._dir_reader_class is None: self._dir_reader_class = osutils.UnicodeDirReader self.overrideAttr(osutils, "_selected_dir_reader", self._dir_reader_class()) def create_empty_dirstate(self): """Return a locked but empty dirstate.""" state = dirstate.DirState.initialize("dirstate") return state def create_dirstate_with_root(self): """Return a write-locked state with a single root entry.""" packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" root_entry_direntry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), ], ) dirblocks = [] dirblocks.append((b"", [root_entry_direntry])) dirblocks.append((b"", [])) state = self.create_empty_dirstate() try: state._set_data([], dirblocks) state._validate() except: state.unlock() raise return state def create_dirstate_with_root_and_subdir(self): """Return a locked DirState with a root and a subdir.""" packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" subdir_entry = ( (b"", b"subdir", b"subdir-id"), [ (b"d", b"", 0, False, packed_stat), ], ) state = self.create_dirstate_with_root() try: dirblocks = list(state._dirblocks) dirblocks[1][1].append(subdir_entry) state._set_data([], dirblocks) except: state.unlock() raise return state def create_complex_dirstate(self): r"""This dirstate contains multiple files and directories. / a-root-value a/ a-dir b/ b-dir c c-file d d-file a/e/ e-dir a/f f-file b/g g-file b/h\xc3\xa5 h-\xc3\xa5-file #This is u'\xe5' encoded into utf-8 Notice that a/e is an empty directory. :return: The dirstate, still write-locked. """ packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" null_sha = b"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" root_entry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), ], ) a_entry = ( (b"", b"a", b"a-dir"), [ (b"d", b"", 0, False, packed_stat), ], ) b_entry = ( (b"", b"b", b"b-dir"), [ (b"d", b"", 0, False, packed_stat), ], ) c_entry = ( (b"", b"c", b"c-file"), [ (b"f", null_sha, 10, False, packed_stat), ], ) d_entry = ( (b"", b"d", b"d-file"), [ (b"f", null_sha, 20, False, packed_stat), ], ) e_entry = ( (b"a", b"e", b"e-dir"), [ (b"d", b"", 0, False, packed_stat), ], ) f_entry = ( (b"a", b"f", b"f-file"), [ (b"f", null_sha, 30, False, packed_stat), ], ) g_entry = ( (b"b", b"g", b"g-file"), [ (b"f", null_sha, 30, False, packed_stat), ], ) h_entry = ( (b"b", b"h\xc3\xa5", b"h-\xc3\xa5-file"), [ (b"f", null_sha, 40, False, packed_stat), ], ) dirblocks = [] dirblocks.append((b"", [root_entry])) dirblocks.append((b"", [a_entry, b_entry, c_entry, d_entry])) dirblocks.append((b"a", [e_entry, f_entry])) dirblocks.append((b"b", [g_entry, h_entry])) state = dirstate.DirState.initialize("dirstate") state._validate() try: state._set_data([], dirblocks) except: state.unlock() raise return state def check_state_with_reopen(self, expected_result, state): """Check that state has current state expected_result. This will check the current state, open the file anew and check it again. This function expects the current state to be locked for writing, and will unlock it before re-opening. This is required because we can't open a lock_read() while something else has a lock_write(). write => mutually exclusive lock read => shared lock """ # The state should already be write locked, since we just had to do # some operation to get here. self.assertIsNotNone(state._lock_token) try: self.assertEqual(expected_result[0], state.get_parent_ids()) # there should be no ghosts in this tree. self.assertEqual([], state.get_ghosts()) # there should be one fileid in this tree - the root of the tree. self.assertEqual(expected_result[1], list(state._iter_entries())) state.save() finally: state.unlock() del state state = dirstate.DirState.on_file("dirstate") state.lock_read() try: self.assertEqual(expected_result[1], list(state._iter_entries())) finally: state.unlock() class TestDirStateInitialize(TestCaseWithDirState, TestCaseInTempDir): def test_initialize(self): expected_result = ( [], [ ( (b"", b"", b"TREE_ROOT"), # common details [ ( b"d", b"", 0, False, dirstate.DirState.NULLSTAT, ), # current tree ], ) ], ) state = dirstate.DirState.initialize("dirstate") try: self.assertIsInstance(state, dirstate.DirState) lines = state.get_lines() finally: state.unlock() # On win32 you can't read from a locked file, even within the same # process. So we have to unlock and release before we check the file # contents. self.assertFileEqual(b"".join(lines), "dirstate") state.lock_read() # check_state_with_reopen will unlock self.check_state_with_reopen(expected_result, state) class TestGetLines(TestCaseWithDirState, TestCaseInTempDir): def test_get_line_with_2_rows(self): state = self.create_dirstate_with_root_and_subdir() try: self.assertEqual( [ b"#bazaar dirstate flat format 3\n", b"crc32: 41262208\n", b"num_entries: 2\n", b"0\x00\n\x00" b"0\x00\n\x00" b"\x00\x00a-root-value\x00" b"d\x00\x000\x00n\x00AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk\x00\n\x00" b"\x00subdir\x00subdir-id\x00" b"d\x00\x000\x00n\x00AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk\x00\n\x00", ], state.get_lines(), ) finally: state.unlock() def test_entry_to_line(self): state = self.create_dirstate_with_root() try: self.assertEqual( b"\x00\x00a-root-value\x00d\x00\x000\x00n" b"\x00AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk", state._entry_to_line(state._dirblocks[0][1][0]), ) finally: state.unlock() def test_entry_to_line_with_parent(self): packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" root_entry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), # current tree details # first: a pointer to the current location (b"a", b"dirname/basename", 0, False, b""), ], ) state = dirstate.DirState.initialize("dirstate") try: self.assertEqual( b"\x00\x00a-root-value\x00" b"d\x00\x000\x00n\x00AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk\x00" b"a\x00dirname/basename\x000\x00n\x00", state._entry_to_line(root_entry), ) finally: state.unlock() def test_entry_to_line_with_two_parents_at_different_paths(self): # / in the tree, at / in one parent and /dirname/basename in the other. packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" root_entry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), # current tree details (b"d", b"", 0, False, b"rev_id"), # first parent details # second: a pointer to the current location (b"a", b"dirname/basename", 0, False, b""), ], ) state = dirstate.DirState.initialize("dirstate") try: self.assertEqual( b"\x00\x00a-root-value\x00" b"d\x00\x000\x00n\x00AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk\x00" b"d\x00\x000\x00n\x00rev_id\x00" b"a\x00dirname/basename\x000\x00n\x00", state._entry_to_line(root_entry), ) finally: state.unlock() def test_iter_entries(self): # we should be able to iterate the dirstate entries from end to end # this is for get_lines to be easy to read. packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" dirblocks = [] root_entries = [ ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), # current tree details ], ) ] dirblocks.append(("", root_entries)) # add two files in the root subdir_entry = ( (b"", b"subdir", b"subdir-id"), [ (b"d", b"", 0, False, packed_stat), # current tree details ], ) afile_entry = ( (b"", b"afile", b"afile-id"), [ (b"f", b"sha1value", 34, False, packed_stat), # current tree details ], ) dirblocks.append(("", [subdir_entry, afile_entry])) # and one in subdir file_entry2 = ( (b"subdir", b"2file", b"2file-id"), [ (b"f", b"sha1value", 23, False, packed_stat), # current tree details ], ) dirblocks.append(("subdir", [file_entry2])) state = dirstate.DirState.initialize("dirstate") try: state._set_data([], dirblocks) expected_entries = [root_entries[0], subdir_entry, afile_entry, file_entry2] self.assertEqual(expected_entries, list(state._iter_entries())) finally: state.unlock() class TestGetBlockRowIndex(TestCaseWithDirState, TestCaseInTempDir): def assertBlockRowIndexEqual( self, block_index, row_index, dir_present, file_present, state, dirname, basename, tree_index, ): self.assertEqual( (block_index, row_index, dir_present, file_present), state._get_block_entry_index(dirname, basename, tree_index), ) if dir_present: block = state._dirblocks[block_index] self.assertEqual(dirname, block[0]) if dir_present and file_present: row = state._dirblocks[block_index][1][row_index] self.assertEqual(dirname, row[0][0]) self.assertEqual(basename, row[0][1]) def test_simple_structure(self): state = self.create_dirstate_with_root_and_subdir() self.addCleanup(state.unlock) self.assertBlockRowIndexEqual(1, 0, True, True, state, b"", b"subdir", 0) self.assertBlockRowIndexEqual(1, 0, True, False, state, b"", b"bdir", 0) self.assertBlockRowIndexEqual(1, 1, True, False, state, b"", b"zdir", 0) self.assertBlockRowIndexEqual(2, 0, False, False, state, b"a", b"foo", 0) self.assertBlockRowIndexEqual(2, 0, False, False, state, b"subdir", b"foo", 0) def test_complex_structure_exists(self): state = self.create_complex_dirstate() self.addCleanup(state.unlock) # Make sure we can find everything that exists self.assertBlockRowIndexEqual(0, 0, True, True, state, b"", b"", 0) self.assertBlockRowIndexEqual(1, 0, True, True, state, b"", b"a", 0) self.assertBlockRowIndexEqual(1, 1, True, True, state, b"", b"b", 0) self.assertBlockRowIndexEqual(1, 2, True, True, state, b"", b"c", 0) self.assertBlockRowIndexEqual(1, 3, True, True, state, b"", b"d", 0) self.assertBlockRowIndexEqual(2, 0, True, True, state, b"a", b"e", 0) self.assertBlockRowIndexEqual(2, 1, True, True, state, b"a", b"f", 0) self.assertBlockRowIndexEqual(3, 0, True, True, state, b"b", b"g", 0) self.assertBlockRowIndexEqual(3, 1, True, True, state, b"b", b"h\xc3\xa5", 0) def test_complex_structure_missing(self): state = self.create_complex_dirstate() self.addCleanup(state.unlock) # Make sure things would be inserted in the right locations # '_' comes before 'a' self.assertBlockRowIndexEqual(0, 0, True, True, state, b"", b"", 0) self.assertBlockRowIndexEqual(1, 0, True, False, state, b"", b"_", 0) self.assertBlockRowIndexEqual(1, 1, True, False, state, b"", b"aa", 0) self.assertBlockRowIndexEqual(1, 4, True, False, state, b"", b"h\xc3\xa5", 0) self.assertBlockRowIndexEqual(2, 0, False, False, state, b"_", b"a", 0) self.assertBlockRowIndexEqual(3, 0, False, False, state, b"aa", b"a", 0) self.assertBlockRowIndexEqual(4, 0, False, False, state, b"bb", b"a", 0) # This would be inserted between a/ and b/ self.assertBlockRowIndexEqual(3, 0, False, False, state, b"a/e", b"a", 0) # Put at the end self.assertBlockRowIndexEqual(4, 0, False, False, state, b"e", b"a", 0) class TestGetEntry(TestCaseWithDirState, TestCaseInTempDir): def assertEntryEqual(self, dirname, basename, file_id, state, path, index): """Check that the right entry is returned for a request to getEntry.""" entry = state._get_entry(index, path_utf8=path) if file_id is None: self.assertEqual((None, None), entry) else: cur = entry[0] self.assertEqual((dirname, basename, file_id), cur[:3]) def test_simple_structure(self): state = self.create_dirstate_with_root_and_subdir() self.addCleanup(state.unlock) self.assertEntryEqual(b"", b"", b"a-root-value", state, b"", 0) self.assertEntryEqual(b"", b"subdir", b"subdir-id", state, b"subdir", 0) self.assertEntryEqual(None, None, None, state, b"missing", 0) self.assertEntryEqual(None, None, None, state, b"missing/foo", 0) self.assertEntryEqual(None, None, None, state, b"subdir/foo", 0) def test_complex_structure_exists(self): state = self.create_complex_dirstate() self.addCleanup(state.unlock) self.assertEntryEqual(b"", b"", b"a-root-value", state, b"", 0) self.assertEntryEqual(b"", b"a", b"a-dir", state, b"a", 0) self.assertEntryEqual(b"", b"b", b"b-dir", state, b"b", 0) self.assertEntryEqual(b"", b"c", b"c-file", state, b"c", 0) self.assertEntryEqual(b"", b"d", b"d-file", state, b"d", 0) self.assertEntryEqual(b"a", b"e", b"e-dir", state, b"a/e", 0) self.assertEntryEqual(b"a", b"f", b"f-file", state, b"a/f", 0) self.assertEntryEqual(b"b", b"g", b"g-file", state, b"b/g", 0) self.assertEntryEqual( b"b", b"h\xc3\xa5", b"h-\xc3\xa5-file", state, b"b/h\xc3\xa5", 0 ) def test_complex_structure_missing(self): state = self.create_complex_dirstate() self.addCleanup(state.unlock) self.assertEntryEqual(None, None, None, state, b"_", 0) self.assertEntryEqual(None, None, None, state, b"_\xc3\xa5", 0) self.assertEntryEqual(None, None, None, state, b"a/b", 0) self.assertEntryEqual(None, None, None, state, b"c/d", 0) def test_get_entry_uninitialized(self): """Calling get_entry will load data if it needs to.""" state = self.create_dirstate_with_root() try: state.save() finally: state.unlock() del state state = dirstate.DirState.on_file("dirstate") state.lock_read() try: self.assertEqual(dirstate.DirState.NOT_IN_MEMORY, state._header_state) self.assertEqual(dirstate.DirState.NOT_IN_MEMORY, state._dirblock_state) self.assertEntryEqual(b"", b"", b"a-root-value", state, b"", 0) finally: state.unlock() class TestIterChildEntries(TestCaseWithDirState, TestCaseInTempDir): def create_dirstate_with_two_trees(self): r"""This dirstate contains multiple files and directories. / a-root-value a/ a-dir b/ b-dir c c-file d d-file a/e/ e-dir a/f f-file b/g g-file b/h\xc3\xa5 h-\xc3\xa5-file #This is u'\xe5' encoded into utf-8 Notice that a/e is an empty directory. There is one parent tree, which has the same shape with the following variations: b/g in the parent is gone. b/h in the parent has a different id b/i is new in the parent c is renamed to b/j in the parent :return: The dirstate, still write-locked. """ packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" null_sha = b"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" NULL_PARENT_DETAILS = dirstate.DirState.NULL_PARENT_DETAILS root_entry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, b"parent-revid"), ], ) a_entry = ( (b"", b"a", b"a-dir"), [ (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, b"parent-revid"), ], ) b_entry = ( (b"", b"b", b"b-dir"), [ (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, b"parent-revid"), ], ) c_entry = ( (b"", b"c", b"c-file"), [ (b"f", null_sha, 10, False, packed_stat), (b"r", b"b/j", 0, False, b""), ], ) d_entry = ( (b"", b"d", b"d-file"), [ (b"f", null_sha, 20, False, packed_stat), (b"f", b"d", 20, False, b"parent-revid"), ], ) e_entry = ( (b"a", b"e", b"e-dir"), [ (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, b"parent-revid"), ], ) f_entry = ( (b"a", b"f", b"f-file"), [ (b"f", null_sha, 30, False, packed_stat), (b"f", b"f", 20, False, b"parent-revid"), ], ) g_entry = ( (b"b", b"g", b"g-file"), [ (b"f", null_sha, 30, False, packed_stat), NULL_PARENT_DETAILS, ], ) h_entry1 = ( (b"b", b"h\xc3\xa5", b"h-\xc3\xa5-file1"), [ (b"f", null_sha, 40, False, packed_stat), NULL_PARENT_DETAILS, ], ) h_entry2 = ( (b"b", b"h\xc3\xa5", b"h-\xc3\xa5-file2"), [ NULL_PARENT_DETAILS, (b"f", b"h", 20, False, b"parent-revid"), ], ) i_entry = ( (b"b", b"i", b"i-file"), [ NULL_PARENT_DETAILS, (b"f", b"h", 20, False, b"parent-revid"), ], ) j_entry = ( (b"b", b"j", b"c-file"), [ (b"r", b"c", 0, False, b""), (b"f", b"j", 20, False, b"parent-revid"), ], ) dirblocks = [] dirblocks.append((b"", [root_entry])) dirblocks.append((b"", [a_entry, b_entry, c_entry, d_entry])) dirblocks.append((b"a", [e_entry, f_entry])) dirblocks.append((b"b", [g_entry, h_entry1, h_entry2, i_entry, j_entry])) state = dirstate.DirState.initialize("dirstate") state._validate() try: state._set_data([b"parent"], dirblocks) except: state.unlock() raise return state, dirblocks def test_iter_children_b(self): state, dirblocks = self.create_dirstate_with_two_trees() self.addCleanup(state.unlock) expected_result = [] expected_result.append(dirblocks[3][1][2]) # h2 expected_result.append(dirblocks[3][1][3]) # i expected_result.append(dirblocks[3][1][4]) # j self.assertEqual(expected_result, list(state._iter_child_entries(1, b"b"))) def test_iter_child_root(self): state, dirblocks = self.create_dirstate_with_two_trees() self.addCleanup(state.unlock) expected_result = [] expected_result.append(dirblocks[1][1][0]) # a expected_result.append(dirblocks[1][1][1]) # b expected_result.append(dirblocks[1][1][3]) # d expected_result.append(dirblocks[2][1][0]) # e expected_result.append(dirblocks[2][1][1]) # f expected_result.append(dirblocks[3][1][2]) # h2 expected_result.append(dirblocks[3][1][3]) # i expected_result.append(dirblocks[3][1][4]) # j self.assertEqual(expected_result, list(state._iter_child_entries(1, b""))) class InstrumentedDirState(dirstate.DirState): """An DirState with instrumented sha1 functionality.""" def __init__( self, path, sha1_provider, worth_saving_limit=0, use_filesystem_for_exec=True, fdatasync=False, ): super().__init__( path, sha1_provider, worth_saving_limit=worth_saving_limit, use_filesystem_for_exec=use_filesystem_for_exec, fdatasync=fdatasync, ) self._time_offset = 0 self._log = [] # member is dynamically set in DirState.__init__ to turn on trace self._sha1_provider = sha1_provider self._sha1_file = self._sha1_file_and_log def _sha_cutoff_time(self): timestamp = super()._sha_cutoff_time() self._cutoff_time = timestamp + self._time_offset def _sha1_file_and_log(self, abspath): self._log.append(("sha1", abspath)) return self._sha1_provider.sha1(abspath) def _read_link(self, abspath, old_link): self._log.append(("read_link", abspath, old_link)) return super()._read_link(abspath, old_link) def _lstat(self, abspath, entry): self._log.append(("lstat", abspath)) return super()._lstat(abspath, entry) def _is_executable(self, mode, old_executable): self._log.append(("is_exec", mode, old_executable)) return super()._is_executable(mode, old_executable) def adjust_time(self, secs): """Move the clock forward or back. :param secs: The amount to adjust the clock by. Positive values make it seem as if we are in the future, negative values make it seem like we are in the past. """ self._time_offset += secs self._cutoff_time = None class _FakeStat: """A class with the same attributes as a real stat result.""" def __init__(self, size, mtime, ctime, dev, ino, mode): self.st_size = size self.st_mtime = mtime self.st_ctime = ctime self.st_dev = dev self.st_ino = ino self.st_mode = mode @staticmethod def from_stat(st): return _FakeStat( st.st_size, st.st_mtime, st.st_ctime, st.st_dev, st.st_ino, st.st_mode ) class TestDiscardMergeParents(TestCaseWithDirState, TestCaseInTempDir): def test_discard_no_parents(self): # This should be a no-op state = self.create_empty_dirstate() self.addCleanup(state.unlock) state._discard_merge_parents() state._validate() def test_discard_one_parent(self): # No-op packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" root_entry_direntry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, packed_stat), ], ) dirblocks = [] dirblocks.append((b"", [root_entry_direntry])) dirblocks.append((b"", [])) state = self.create_empty_dirstate() self.addCleanup(state.unlock) state._set_data([b"parent-id"], dirblocks[:]) state._validate() state._discard_merge_parents() state._validate() self.assertEqual(dirblocks, state._dirblocks) def test_discard_simple(self): # No-op packed_stat = b"AAAAREUHaIpFB2iKAAADAQAtkqUAAIGk" root_entry_direntry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, packed_stat), ], ) expected_root_entry_direntry = ( (b"", b"", b"a-root-value"), [ (b"d", b"", 0, False, packed_stat), (b"d", b"", 0, False, packed_stat), ], ) dirblocks = [] dirblocks.append((b"", [root_entry_direntry])) dirblocks.append((b"", [])) state = self.create_empty_dirstate() self.addCleanup(state.unlock) state._set_data([b"parent-id", b"merged-id"], dirblocks[:]) state._validate() # This should strip of the extra column state._discard_merge_parents() state._validate() expected_dirblocks = [(b"", [expected_root_entry_direntry]), (b"", [])] self.assertEqual(expected_dirblocks, state._dirblocks) def test_discard_absent(self): """If entries are only in a merge, discard should remove the entries.""" null_stat = dirstate.DirState.NULLSTAT present_dir = (b"d", b"", 0, False, null_stat) present_file = (b"f", b"", 0, False, null_stat) absent = dirstate.DirState.NULL_PARENT_DETAILS root_key = (b"", b"", b"a-root-value") file_in_root_key = (b"", b"file-in-root", b"a-file-id") file_in_merged_key = (b"", b"file-in-merged", b"b-file-id") dirblocks = [ (b"", [(root_key, [present_dir, present_dir, present_dir])]), ( b"", [ (file_in_merged_key, [absent, absent, present_file]), (file_in_root_key, [present_file, present_file, present_file]), ], ), ] state = self.create_empty_dirstate() self.addCleanup(state.unlock) state._set_data([b"parent-id", b"merged-id"], dirblocks[:]) state._validate() exp_dirblocks = [ (b"", [(root_key, [present_dir, present_dir])]), ( b"", [ (file_in_root_key, [present_file, present_file]), ], ), ] state._discard_merge_parents() state._validate() self.assertEqual(exp_dirblocks, state._dirblocks) def test_discard_renamed(self): null_stat = dirstate.DirState.NULLSTAT present_dir = (b"d", b"", 0, False, null_stat) present_file = (b"f", b"", 0, False, null_stat) absent = dirstate.DirState.NULL_PARENT_DETAILS root_key = (b"", b"", b"a-root-value") file_in_root_key = (b"", b"file-in-root", b"a-file-id") # Renamed relative to parent file_rename_s_key = (b"", b"file-s", b"b-file-id") file_rename_t_key = (b"", b"file-t", b"b-file-id") # And one that is renamed between the parents, but absent in this key_in_1 = (b"", b"file-in-1", b"c-file-id") key_in_2 = (b"", b"file-in-2", b"c-file-id") dirblocks = [ (b"", [(root_key, [present_dir, present_dir, present_dir])]), ( b"", [ ( key_in_1, [absent, present_file, (b"r", b"file-in-2", b"c-file-id")], ), ( key_in_2, [absent, (b"r", b"file-in-1", b"c-file-id"), present_file], ), (file_in_root_key, [present_file, present_file, present_file]), ( file_rename_s_key, [(b"r", b"file-t", b"b-file-id"), absent, present_file], ), ( file_rename_t_key, [present_file, absent, (b"r", b"file-s", b"b-file-id")], ), ], ), ] exp_dirblocks = [ (b"", [(root_key, [present_dir, present_dir])]), ( b"", [ (key_in_1, [absent, present_file]), (file_in_root_key, [present_file, present_file]), (file_rename_t_key, [present_file, absent]), ], ), ] state = self.create_empty_dirstate() self.addCleanup(state.unlock) state._set_data([b"parent-id", b"merged-id"], dirblocks[:]) state._validate() state._discard_merge_parents() state._validate() self.assertEqual(exp_dirblocks, state._dirblocks) def test_discard_all_subdir(self): null_stat = dirstate.DirState.NULLSTAT present_dir = (b"d", b"", 0, False, null_stat) present_file = (b"f", b"", 0, False, null_stat) absent = dirstate.DirState.NULL_PARENT_DETAILS root_key = (b"", b"", b"a-root-value") subdir_key = (b"", b"sub", b"dir-id") child1_key = (b"sub", b"child1", b"child1-id") child2_key = (b"sub", b"child2", b"child2-id") child3_key = (b"sub", b"child3", b"child3-id") dirblocks = [ (b"", [(root_key, [present_dir, present_dir, present_dir])]), (b"", [(subdir_key, [present_dir, present_dir, present_dir])]), ( b"sub", [ (child1_key, [absent, absent, present_file]), (child2_key, [absent, absent, present_file]), (child3_key, [absent, absent, present_file]), ], ), ] exp_dirblocks = [ (b"", [(root_key, [present_dir, present_dir])]), (b"", [(subdir_key, [present_dir, present_dir])]), (b"sub", []), ] state = self.create_empty_dirstate() self.addCleanup(state.unlock) state._set_data([b"parent-id", b"merged-id"], dirblocks[:]) state._validate() state._discard_merge_parents() state._validate() self.assertEqual(exp_dirblocks, state._dirblocks) class Test_InvEntryToDetails(TestCase): def assertDetails(self, expected, inv_entry): details = dirstate._inv_entry_to_details(inv_entry) self.assertEqual(expected, details) # details should always allow join() and always be a plain str when # finished (minikind, fingerprint, _size, _executable, tree_data) = details self.assertIsInstance(minikind, bytes) self.assertIsInstance(fingerprint, bytes) self.assertIsInstance(tree_data, bytes) def test_unicode_symlink(self): target = "link-targ\N{EURO SIGN}t" inv_entry = inventory.InventoryLink( b"link-file-id", "nam\N{EURO SIGN}e", b"link-parent-id", b"link-revision-id", symlink_target=target, ) self.assertDetails( (b"l", target.encode("UTF-8"), 0, False, b"link-revision-id"), inv_entry ) class TestSHA1Provider(TestCaseInTempDir): def test_sha1provider_is_an_interface(self): p = dirstate.SHA1Provider() self.assertRaises(NotImplementedError, p.sha1, "foo") self.assertRaises(NotImplementedError, p.stat_and_sha1, "foo") def test_defaultsha1provider_sha1(self): text = b"test\r\nwith\nall\rpossible line endings\r\n" self.build_tree_contents([("foo", text)]) expected_sha = osutils.sha_string(text) p = dirstate.DefaultSHA1Provider() self.assertEqual(expected_sha, p.sha1("foo")) def test_defaultsha1provider_stat_and_sha1(self): text = b"test\r\nwith\nall\rpossible line endings\r\n" self.build_tree_contents([("foo", text)]) expected_sha = osutils.sha_string(text) p = dirstate.DefaultSHA1Provider() statvalue, sha1 = p.stat_and_sha1("foo") self.assertEqual(len(text), statvalue.st_size) self.assertEqual(expected_sha, sha1) class TestBisectDirblock(TestCase): """Test that bisect_dirblock() returns the expected values. bisect_dirblock is intended to work like bisect.bisect_left() except it knows it is working on dirblocks and that dirblocks are sorted by ('path', 'to', 'foo') chunks rather than by raw 'path/to/foo'. """ def assertBisect(self, dirblocks, split_dirblocks, path, *args, **kwargs): """Assert that bisect_split works like bisect_left on the split paths. :param dirblocks: A list of (path, [info]) pairs. :param split_dirblocks: A list of ((split, path), [info]) pairs. :param path: The path we are indexing. All other arguments will be passed along. """ self.assertIsInstance(dirblocks, list) bisect_split_idx = dirstate.bisect_dirblock(dirblocks, path, *args, **kwargs) split_dirblock = (path.split(b"/"), []) bisect_left_idx = bisect.bisect_left(split_dirblocks, split_dirblock, *args) self.assertEqual( bisect_left_idx, bisect_split_idx, "bisect_split disagreed. {} != {} for key {!r}".format( bisect_left_idx, bisect_split_idx, path ), ) def paths_to_dirblocks(self, paths): """Convert a list of paths into dirblock form. Also, ensure that the paths are in proper sorted order. """ dirblocks = [(path, []) for path in paths] split_dirblocks = [(path.split(b"/"), []) for path in paths] self.assertEqual(sorted(split_dirblocks), split_dirblocks) return dirblocks, split_dirblocks def test_simple(self): """In the simple case it works just like bisect_left.""" paths = [b"", b"a", b"b", b"c", b"d"] dirblocks, split_dirblocks = self.paths_to_dirblocks(paths) for path in paths: self.assertBisect(dirblocks, split_dirblocks, path) self.assertBisect(dirblocks, split_dirblocks, b"_") self.assertBisect(dirblocks, split_dirblocks, b"aa") self.assertBisect(dirblocks, split_dirblocks, b"bb") self.assertBisect(dirblocks, split_dirblocks, b"cc") self.assertBisect(dirblocks, split_dirblocks, b"dd") self.assertBisect(dirblocks, split_dirblocks, b"a/a") self.assertBisect(dirblocks, split_dirblocks, b"b/b") self.assertBisect(dirblocks, split_dirblocks, b"c/c") self.assertBisect(dirblocks, split_dirblocks, b"d/d") def test_involved(self): """This is where bisect_left diverges slightly.""" paths = [ b"", b"a", b"a/a", b"a/a/a", b"a/a/z", b"a/a-a", b"a/a-z", b"a/z", b"a/z/a", b"a/z/z", b"a/z-a", b"a/z-z", b"a-a", b"a-z", b"z", b"z/a/a", b"z/a/z", b"z/a-a", b"z/a-z", b"z/z", b"z/z/a", b"z/z/z", b"z/z-a", b"z/z-z", b"z-a", b"z-z", ] dirblocks, split_dirblocks = self.paths_to_dirblocks(paths) for path in paths: self.assertBisect(dirblocks, split_dirblocks, path) def test_involved_cached(self): """This is where bisect_left diverges slightly.""" paths = [ b"", b"a", b"a/a", b"a/a/a", b"a/a/z", b"a/a-a", b"a/a-z", b"a/z", b"a/z/a", b"a/z/z", b"a/z-a", b"a/z-z", b"a-a", b"a-z", b"z", b"z/a/a", b"z/a/z", b"z/a-a", b"z/a-z", b"z/z", b"z/z/a", b"z/z/z", b"z/z-a", b"z/z-z", b"z-a", b"z-z", ] cache = {} dirblocks, split_dirblocks = self.paths_to_dirblocks(paths) for path in paths: self.assertBisect(dirblocks, split_dirblocks, path, cache=cache) def _unpack_stat(packed_stat): """Turn a packed_stat back into the stat fields. This is meant as a debugging tool, should not be used in real code. """ (st_size, st_mtime, st_ctime, st_dev, st_ino, st_mode) = struct.unpack( ">6L", binascii.a2b_base64(packed_stat) ) return { "st_size": st_size, "st_mtime": st_mtime, "st_ctime": st_ctime, "st_dev": st_dev, "st_ino": st_ino, "st_mode": st_mode, } class TestPackStatRobust(TestCase): """Check packed representaton of stat values is robust on all inputs.""" def pack(self, statlike_tuple): return dirstate.pack_stat(os.stat_result(statlike_tuple)) @staticmethod def unpack_field(packed_string, stat_field): return _unpack_stat(packed_string)[stat_field] bzrformats_3.4.0.orig/bzrformats/tests/test_errors.py0000644000000000000000000001033715162115103020135 0ustar00# Copyright (C) 2025 Breezy Contributors # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for bzrformats error classes.""" from .. import errors from . import TestCase class TestNoSuchFile(TestCase): """Test NoSuchFile error.""" def test_no_such_file_str(self): """Test string representation of NoSuchFile.""" err = errors.NoSuchFile("/path/to/missing/file") self.assertEqual("No such file: '/path/to/missing/file'", str(err)) def test_no_such_file_with_extra(self): """Test NoSuchFile with extra information.""" err = errors.NoSuchFile("/path/to/file", "additional info") self.assertEqual("No such file: '/path/to/file': additional info", str(err)) class TestPathError(TestCase): """Test PathError base class.""" def test_path_error_str(self): """Test string representation of PathError.""" err = errors.PathError("/some/path") self.assertEqual("Path error: '/some/path'", str(err)) def test_path_error_with_extra(self): """Test PathError with extra information.""" err = errors.PathError("/some/path", "extra details") self.assertEqual("Path error: '/some/path': extra details", str(err)) class TestReservedId(TestCase): """Test ReservedId error.""" def test_reserved_id_str(self): """Test string representation of ReservedId.""" err = errors.ReservedId(b"null:") self.assertEqual("Reserved revision-id {b'null:'}", str(err)) class TestRevisionNotPresent(TestCase): """Test RevisionNotPresent error.""" def test_revision_not_present_str(self): """Test string representation of RevisionNotPresent.""" err = errors.RevisionNotPresent(b"rev-123", b"file-456") expected = "Revision {b'rev-123'} not present in \"b'file-456'\"." self.assertEqual(expected, str(err)) class TestRevisionAlreadyPresent(TestCase): """Test RevisionAlreadyPresent error.""" def test_revision_already_present_str(self): """Test string representation of RevisionAlreadyPresent.""" err = errors.RevisionAlreadyPresent(b"rev-123", b"file-456") expected = "Revision {b'rev-123'} already present in \"b'file-456'\"." self.assertEqual(expected, str(err)) class TestInvalidRevisionId(TestCase): """Test InvalidRevisionId error.""" def test_invalid_revision_id_str(self): """Test string representation of InvalidRevisionId.""" err = errors.InvalidRevisionId(b"bad-rev", "mybranch") expected = "Invalid revision-id {b'bad-rev'} in mybranch" self.assertEqual(expected, str(err)) class TestNoSuchId(TestCase): """Test NoSuchId error.""" def test_no_such_id_str(self): """Test string representation of NoSuchId.""" from bzrformats.inventory import NoSuchId err = NoSuchId("tree-object", b"file-id-123") expected = ( "The file id \"b'file-id-123'\" is not present in the tree tree-object." ) self.assertEqual(expected, str(err)) class TestInconsistentDelta(TestCase): def test_inconsistent_delta_str(self): err = errors.InconsistentDelta("path", "file-id", "reason for foo") self.assertEqual( "An inconsistent delta was supplied involving 'path', 'file-id'\n" "reason: reason for foo", str(err), ) # Add test module discovery def test_suite(): """Return the test suite for error tests.""" import unittest return unittest.TestLoader().loadTestsFromModule(sys.modules[__name__]) if __name__ == "__main__": import unittest unittest.main() bzrformats_3.4.0.orig/bzrformats/tests/test_generate_ids.py0000644000000000000000000001422615162115103021253 0ustar00# Copyright (C) 2006, 2007, 2009, 2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for bzrformats/generate_ids.py.""" from .. import generate_ids from . import TestCase class TestFileIds(TestCase): """Test functions which generate file ids.""" def assertGenFileId(self, regex, filename): """gen_file_id should create a file id matching the regex. The file id should be ascii, and should be an 8-bit string """ file_id = generate_ids.gen_file_id(filename) self.assertContainsRe(file_id, b"^" + regex + b"$") # It should be a utf8 file_id, not a unicode one self.assertIsInstance(file_id, bytes) # gen_file_id should always return ascii file ids. file_id.decode("ascii") def test_gen_file_id(self): gen_file_id = generate_ids.gen_file_id # We try to use the filename if possible self.assertStartsWith(gen_file_id("bar"), b"bar-") # but we squash capitalization, and remove non word characters self.assertStartsWith(gen_file_id("Mwoo oof\t m"), b"mwoooofm-") # We also remove leading '.' characters to prevent hidden file-ids self.assertStartsWith(gen_file_id("..gam.py"), b"gam.py-") self.assertStartsWith(gen_file_id("..Mwoo oof\t m"), b"mwoooofm-") # we remove unicode characters, and still don't end up with a # hidden file id self.assertStartsWith(gen_file_id("\xe5\xb5.txt"), b"txt-") # Our current method of generating unique ids adds 33 characters # plus an serial number (log10(N) characters) # to the end of the filename. We now restrict the filename portion to # be <= 20 characters, so the maximum length should now be approx < 60 # Test both case squashing and length restriction fid = gen_file_id("A" * 50 + ".txt") self.assertStartsWith(fid, b"a" * 20 + b"-") self.assertLess(len(fid), 60) # restricting length happens after the other actions, so # we preserve as much as possible fid = gen_file_id("\xe5\xb5..aBcd\tefGhijKLMnop\tqrstuvwxyz") self.assertStartsWith(fid, b"abcdefghijklmnopqrst-") self.assertLess(len(fid), 60) def test_file_ids_are_ascii(self): tail = rb"-\d{14}-[a-z0-9]{16}-\d+" self.assertGenFileId(b"foo" + tail, "foo") self.assertGenFileId(b"foo" + tail, "foo") self.assertGenFileId(b"bar" + tail, "bar") self.assertGenFileId(b"br" + tail, "b\xe5r") def test__next_id_suffix_increments(self): ids = [generate_ids._next_id_suffix(suffix="foo-") for i in range(10)] ns = [int(id.split(b"-")[-1]) for id in ids] for i in range(1, len(ns)): self.assertEqual(ns[i] - 1, ns[i - 1]) def test_gen_root_id(self): # Mostly just make sure gen_root_id() exists root_id = generate_ids.gen_root_id() self.assertStartsWith(root_id, b"tree_root-") class TestGenRevisionId(TestCase): """Test generating revision ids.""" def assertGenRevisionId(self, regex, username, timestamp=None): """gen_revision_id should create a revision id matching the regex.""" revision_id = generate_ids.gen_revision_id(username, timestamp) self.assertContainsRe(revision_id, b"^" + regex + b"$") # It should be a utf8 revision_id, not a unicode one self.assertIsInstance(revision_id, bytes) # gen_revision_id should always return ascii revision ids. revision_id.decode("ascii") def test_timestamp(self): """Passing a timestamp should cause it to be used.""" self.assertGenRevisionId(rb"user@host-\d{14}-[a-z0-9]{16}", "user@host") self.assertGenRevisionId( b"user@host-20061102205056-[a-z0-9]{16}", "user@host", 1162500656.688 ) self.assertGenRevisionId( rb"user@host-20061102205024-[a-z0-9]{16}", "user@host", 1162500624.000 ) def test_gen_revision_id_email(self): """gen_revision_id uses email address if present.""" regex = rb"user\+joe_bar@foo-bar\.com-\d{14}-[a-z0-9]{16}" self.assertGenRevisionId(regex, "user+joe_bar@foo-bar.com") self.assertGenRevisionId(regex, "") self.assertGenRevisionId(regex, "Joe Bar ") self.assertGenRevisionId(regex, "Joe Bar ") self.assertGenRevisionId(regex, "Joe B\xe5r ") def test_gen_revision_id_user(self): """If there is no email, fall back to the whole username.""" tail = rb"-\d{14}-[a-z0-9]{16}" self.assertGenRevisionId(b"joe_bar" + tail, "Joe Bar") self.assertGenRevisionId(b"joebar" + tail, "joebar") self.assertGenRevisionId(b"joe_br" + tail, "Joe B\xe5r") self.assertGenRevisionId( rb"joe_br_user\+joe_bar_foo-bar.com" + tail, "Joe B\xe5r ", ) def test_revision_ids_are_ascii(self): """gen_revision_id should always return an ascii revision id.""" tail = rb"-\d{14}-[a-z0-9]{16}" self.assertGenRevisionId(b"joe_bar" + tail, "Joe Bar") self.assertGenRevisionId(b"joe_bar" + tail, "Joe Bar") self.assertGenRevisionId(b"joe@foo" + tail, "Joe Bar ") # We cheat a little with this one, because email-addresses shouldn't # contain non-ascii characters, but generate_ids should strip them # anyway. self.assertGenRevisionId(b"joe@f" + tail, "Joe Bar ") bzrformats_3.4.0.orig/bzrformats/tests/test_groupcompress.py0000644000000000000000000015447715162115103021547 0ustar00# Copyright (C) 2008-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for group compression.""" import logging import zlib from testscenarios import load_tests_apply_scenarios from .. import btree_index, groupcompress, knit, osutils, versionedfile from .. import index as _mod_index from ..osutils import sha_string from . import TestCase, TestCaseWithMemoryTransport, TestNotApplicable from .test__groupcompress import _compiled_groupcompress_module def group_compress_implementation_scenarios(): scenarios = [ ("python", {"compressor": groupcompress.PythonGroupCompressor}), ] if _compiled_groupcompress_module is not None: scenarios.append(("C", {"compressor": groupcompress.PyrexGroupCompressor})) return scenarios load_tests = load_tests_apply_scenarios class TestGroupCompressor(TestCase): def _chunks_to_repr_lines(self, chunks): return "\n".join(map(repr, b"".join(chunks).split(b"\n"))) def assertEqualDiffEncoded(self, expected, actual): """Compare the actual content to the expected content. :param expected: A group of chunks that we expect to see :param actual: The measured 'chunks' We will transform the chunks back into lines, and then run 'repr()' over them to handle non-ascii characters. """ self.assertEqualDiff( self._chunks_to_repr_lines(expected), self._chunks_to_repr_lines(actual) ) class TestAllGroupCompressors(TestGroupCompressor): """Tests for GroupCompressor.""" scenarios = group_compress_implementation_scenarios() compressor = None # Set by scenario def test_empty_delta(self): compressor = self.compressor() self.assertEqual([], compressor.chunks) def test_one_nosha_delta(self): # diff against NUKK compressor = self.compressor() text = b"strange\ncommon\n" sha1, start_point, end_point, _ = compressor.compress( (b"label",), [text], len(text), None ) self.assertEqual(sha_string(b"strange\ncommon\n"), sha1) expected_lines = b"f\x0fstrange\ncommon\n" self.assertEqual(expected_lines, b"".join(compressor.chunks)) self.assertEqual(0, start_point) self.assertEqual(len(expected_lines), end_point) def test_empty_content(self): compressor = self.compressor() # Adding empty bytes should return the 'null' record sha1, start_point, end_point, kind = compressor.compress( (b"empty",), [], 0, None ) self.assertEqual(0, start_point) self.assertEqual(0, end_point) self.assertEqual("fulltext", kind) self.assertEqual(groupcompress._null_sha1, sha1) self.assertEqual(0, compressor.endpoint) self.assertEqual([], compressor.chunks) # Even after adding some content text = b"some\nbytes\n" compressor.compress((b"content",), [text], len(text), None) self.assertGreater(compressor.endpoint, 0) sha1, start_point, end_point, kind = compressor.compress( (b"empty2",), [], 0, None ) self.assertEqual(0, start_point) self.assertEqual(0, end_point) self.assertEqual("fulltext", kind) self.assertEqual(groupcompress._null_sha1, sha1) def test_extract_from_compressor(self): # Knit fetching will try to reconstruct texts locally which results in # reading something that is in the compressor stream already. compressor = self.compressor() text = b"strange\ncommon long line\nthat needs a 16 byte match\n" sha1_1, _, _, _ = compressor.compress((b"label",), [text], len(text), None) list(compressor.chunks) text = b"common long line\nthat needs a 16 byte match\ndifferent\n" sha1_2, _, _end_point, _ = compressor.compress( (b"newlabel",), [text], len(text), None ) # get the first out self.assertEqual( ([b"strange\ncommon long line\nthat needs a 16 byte match\n"], sha1_1), compressor.extract((b"label",)), ) # and the second self.assertEqual( ( [b"common long line\nthat needs a 16 byte match\ndifferent\n"], sha1_2, ), compressor.extract((b"newlabel",)), ) class TestPyrexGroupCompressor(TestGroupCompressor): compressor = groupcompress.PyrexGroupCompressor def setUp(self): super().setUp() if _compiled_groupcompress_module is None: self.skipTest("bzrformats._groupcompress_pyx not available") def test_stats(self): compressor = self.compressor() chunks = [b"strange\n", b"common very very long line\n", b"plus more text\n"] compressor.compress((b"label",), chunks, sum(map(len, chunks)), None) chunks = [ b"common very very long line\n", b"plus more text\n", b"different\n", b"moredifferent\n", ] compressor.compress((b"newlabel",), chunks, sum(map(len, chunks)), None) chunks = [ b"new\n", b"common very very long line\n", b"plus more text\n", b"different\n", b"moredifferent\n", ] compressor.compress((b"label3",), chunks, sum(map(len, chunks)), None) self.assertAlmostEqual(1.9, compressor.ratio(), 1) def test_two_nosha_delta(self): compressor = self.compressor() text = b"strange\ncommon long line\nthat needs a 16 byte match\n" _sha1_1, _, _, _ = compressor.compress((b"label",), [text], len(text), None) expected_lines = list(compressor.chunks) text = b"common long line\nthat needs a 16 byte match\ndifferent\n" sha1_2, _start_point, end_point, _ = compressor.compress( (b"newlabel",), [text], len(text), None ) self.assertEqual(sha_string(text), sha1_2) expected_lines.extend( [ # 'delta', delta length b"d\x0f", # source and target length b"\x36", # copy the line common b"\x91\x0a\x2c", # copy, offset 0x0a, len 0x2c # add the line different, and the trailing newline b"\x0adifferent\n", # insert 10 bytes ] ) self.assertEqualDiffEncoded(expected_lines, compressor.chunks) self.assertEqual(sum(map(len, expected_lines)), end_point) def test_three_nosha_delta(self): # The first interesting test: make a change that should use lines from # both parents. compressor = self.compressor() text = b"strange\ncommon very very long line\nwith some extra text\n" _sha1_1, _, _, _ = compressor.compress((b"label",), [text], len(text), None) text = b"different\nmoredifferent\nand then some more\n" _sha1_2, _, _, _ = compressor.compress((b"newlabel",), [text], len(text), None) expected_lines = list(compressor.chunks) text = ( b"new\ncommon very very long line\nwith some extra text\n" b"different\nmoredifferent\nand then some more\n" ) sha1_3, _start_point, end_point, _ = compressor.compress( (b"label3",), [text], len(text), None ) self.assertEqual(sha_string(text), sha1_3) expected_lines.extend( [ # 'delta', delta length b"d\x0b", # source and target length b"\x5f" # insert new b"\x03new", # Copy of first parent 'common' range b"\x91\x09\x31" # copy, offset 0x09, 0x31 bytes # Copy of second parent 'different' range b"\x91\x3c\x2b", # copy, offset 0x3c, 0x2b bytes ] ) self.assertEqualDiffEncoded(expected_lines, compressor.chunks) self.assertEqual(sum(map(len, expected_lines)), end_point) class TestPythonGroupCompressor(TestGroupCompressor): compressor = groupcompress.PythonGroupCompressor def test_stats(self): compressor = self.compressor() chunks = [b"strange\n", b"common very very long line\n", b"plus more text\n"] compressor.compress((b"label",), chunks, sum(map(len, chunks)), None) chunks = [ b"common very very long line\n", b"plus more text\n", b"different\n", b"moredifferent\n", ] compressor.compress((b"newlabel",), chunks, sum(map(len, chunks)), None) chunks = [ b"new\n", b"common very very long line\n", b"plus more text\n", b"different\n", b"moredifferent\n", ] compressor.compress((b"label3",), chunks, sum(map(len, chunks)), None) self.assertAlmostEqual(1.9, compressor.ratio(), 1) def test_two_nosha_delta(self): compressor = self.compressor() text = b"strange\ncommon long line\nthat needs a 16 byte match\n" _sha1_1, _, _, _ = compressor.compress((b"label",), [text], len(text), None) expected_lines = list(compressor.chunks) text = b"common long line\nthat needs a 16 byte match\ndifferent\n" sha1_2, _start_point, end_point, _ = compressor.compress( (b"newlabel",), [text], len(text), None ) self.assertEqual(sha_string(text), sha1_2) expected_lines.extend( [ # 'delta', delta length b"d\x0f", # target length b"\x36", # copy the line common b"\x91\x0a\x2c", # copy, offset 0x0a, len 0x2c # add the line different, and the trailing newline b"\x0adifferent\n", # insert 10 bytes ] ) self.assertEqualDiffEncoded(expected_lines, compressor.chunks) self.assertEqual(sum(map(len, expected_lines)), end_point) def test_three_nosha_delta(self): # The first interesting test: make a change that should use lines from # both parents. compressor = self.compressor() text = b"strange\ncommon very very long line\nwith some extra text\n" _sha1_1, _, _, _ = compressor.compress((b"label",), [text], len(text), None) text = b"different\nmoredifferent\nand then some more\n" _sha1_2, _, _, _ = compressor.compress((b"newlabel",), [text], len(text), None) expected_lines = list(compressor.chunks) text = ( b"new\ncommon very very long line\nwith some extra text\n" b"different\nmoredifferent\nand then some more\n" ) sha1_3, _start_point, end_point, _ = compressor.compress( (b"label3",), [text], len(text), None ) self.assertEqual(sha_string(text), sha1_3) expected_lines.extend( [ # 'delta', delta length b"d\x0c", # target length b"\x5f" # insert new b"\x04new\n", # Copy of first parent 'common' range b"\x91\x0a\x30" # copy, offset 0x0a, 0x30 bytes # Copy of second parent 'different' range b"\x91\x3c\x2b", # copy, offset 0x3c, 0x2b bytes ] ) self.assertEqualDiffEncoded(expected_lines, compressor.chunks) self.assertEqual(sum(map(len, expected_lines)), end_point) class TestGroupCompressBlock(TestCase): def make_block(self, key_to_text): """Create a GroupCompressBlock, filling it with the given texts.""" compressor = groupcompress.GroupCompressor() for key in sorted(key_to_text): compressor.compress(key, [key_to_text[key]], len(key_to_text[key]), None) locs = { key: (start, end) for key, (start, _, end, _) in compressor.labels_deltas.items() } block = compressor.flush() raw_bytes = block.to_bytes() # Go through from_bytes(to_bytes()) so that we start with a compressed # content object return locs, groupcompress.GroupCompressBlock.from_bytes(raw_bytes) def test_from_empty_bytes(self): self.assertRaises(ValueError, groupcompress.GroupCompressBlock.from_bytes, b"") def test_from_minimal_bytes(self): block = groupcompress.GroupCompressBlock.from_bytes(b"gcb1z\n0\n0\n") self.assertIsInstance(block, groupcompress.GroupCompressBlock) self.assertIs(None, block._content) self.assertEqual(b"", block._z_content) block._ensure_content() self.assertEqual(b"", block._content) self.assertEqual(b"", block._z_content) block._ensure_content() # Ensure content is safe to call 2x def test_from_invalid(self): self.assertRaises( ValueError, groupcompress.GroupCompressBlock.from_bytes, b"this is not a valid header", ) def test_from_bytes(self): content = b"a tiny bit of content\n" z_content = zlib.compress(content) z_bytes = ( b"gcb1z\n" # group compress block v1 plain b"%d\n" # Length of compressed content b"%d\n" # Length of uncompressed content b"%s" # Compressed content ) % (len(z_content), len(content), z_content) block = groupcompress.GroupCompressBlock.from_bytes(z_bytes) self.assertEqual(z_content, block._z_content) self.assertIs(None, block._content) self.assertEqual(len(z_content), block._z_content_length) self.assertEqual(len(content), block._content_length) block._ensure_content() self.assertEqual(z_content, block._z_content) self.assertEqual(content, block._content) def test_to_chunks(self): content_chunks = [ b"this is some content\n", b"this content will be compressed\n", ] content_len = sum(map(len, content_chunks)) content = b"".join(content_chunks) gcb = groupcompress.GroupCompressBlock() gcb.set_chunked_content(content_chunks, content_len) total_len, block_chunks = gcb.to_chunks() block_bytes = b"".join(block_chunks) self.assertEqual(gcb._z_content_length, len(gcb._z_content)) self.assertEqual(total_len, len(block_bytes)) self.assertEqual(gcb._content_length, content_len) expected_header = ( b"gcb1z\n" # group compress block v1 zlib b"%d\n" # Length of compressed content b"%d\n" # Length of uncompressed content ) % (gcb._z_content_length, gcb._content_length) # The first chunk should be the header chunk. It is small, fixed size, # and there is no compelling reason to split it up self.assertEqual(expected_header, block_chunks[0]) self.assertStartsWith(block_bytes, expected_header) remaining_bytes = block_bytes[len(expected_header) :] raw_bytes = zlib.decompress(remaining_bytes) self.assertEqual(content, raw_bytes) def test_to_bytes(self): content = b"this is some content\nthis content will be compressed\n" gcb = groupcompress.GroupCompressBlock() gcb.set_content(content) data = gcb.to_bytes() self.assertEqual(gcb._z_content_length, len(gcb._z_content)) self.assertEqual(gcb._content_length, len(content)) expected_header = ( b"gcb1z\n" # group compress block v1 zlib b"%d\n" # Length of compressed content b"%d\n" # Length of uncompressed content ) % (gcb._z_content_length, gcb._content_length) self.assertStartsWith(data, expected_header) remaining_bytes = data[len(expected_header) :] raw_bytes = zlib.decompress(remaining_bytes) self.assertEqual(content, raw_bytes) # we should get the same results if using the chunked version gcb = groupcompress.GroupCompressBlock() gcb.set_chunked_content( [b"this is some content\nthis content will be compressed\n"], len(content), ) old_data = data data = gcb.to_bytes() self.assertEqual(old_data, data) def test_partial_decomp(self): content_chunks = [] # We need a sufficient amount of data so that zlib.decompress has # partial decompression to work with. Most auto-generated data # compresses a bit too well, we want a combination, so we combine a sha # hash with compressible data. for i in range(2048): next_content = b"%d\nThis is a bit of duplicate text\n" % (i,) content_chunks.append(next_content) next_sha1 = osutils.sha_string(next_content) content_chunks.append(next_sha1 + b"\n") content = b"".join(content_chunks) self.assertEqual(158634, len(content)) z_content = zlib.compress(content) self.assertEqual(57182, len(z_content)) block = groupcompress.GroupCompressBlock() block._z_content_chunks = (z_content,) block._z_content_length = len(z_content) block._compressor_name = "zlib" block._content_length = 158634 self.assertIs(None, block._content) block._ensure_content(100) self.assertIsNot(None, block._content) # We have decompressed at least 100 bytes self.assertGreaterEqual(len(block._content), 100) # We have not decompressed the whole content self.assertLess(len(block._content), 158634) self.assertEqualDiff(content[: len(block._content)], block._content) # ensuring content that we already have shouldn't cause any more data # to be extracted cur_len = len(block._content) block._ensure_content(cur_len - 10) self.assertEqual(cur_len, len(block._content)) # Now we want a bit more content cur_len += 10 block._ensure_content(cur_len) self.assertGreaterEqual(len(block._content), cur_len) self.assertLess(len(block._content), 158634) self.assertEqualDiff(content[: len(block._content)], block._content) # And now lets finish block._ensure_content(158634) self.assertEqualDiff(content, block._content) # And the decompressor is finalized self.assertIs(None, block._z_content_decompressor) def test__ensure_all_content(self): content_chunks = [] # We need a sufficient amount of data so that zlib.decompress has # partial decompression to work with. Most auto-generated data # compresses a bit too well, we want a combination, so we combine a sha # hash with compressible data. for i in range(2048): next_content = b"%d\nThis is a bit of duplicate text\n" % (i,) content_chunks.append(next_content) next_sha1 = osutils.sha_string(next_content) content_chunks.append(next_sha1 + b"\n") content = b"".join(content_chunks) self.assertEqual(158634, len(content)) z_content = zlib.compress(content) self.assertEqual(57182, len(z_content)) block = groupcompress.GroupCompressBlock() block._z_content_chunks = (z_content,) block._z_content_length = len(z_content) block._compressor_name = "zlib" block._content_length = 158634 self.assertIs(None, block._content) # The first _ensure_content got all of the required data block._ensure_content(158634) self.assertEqualDiff(content, block._content) # And we should have released the _z_content_decompressor since it was # fully consumed self.assertIs(None, block._z_content_decompressor) def test__dump(self): dup_content = b"some duplicate content\nwhich is sufficiently long\n" key_to_text = { (b"1",): dup_content + b"1 unique\n", (b"2",): dup_content + b"2 extra special\n", } _locs, block = self.make_block(key_to_text) self.assertEqual( [ (b"f", len(key_to_text[(b"1",)])), ( b"d", 21, len(key_to_text[(b"2",)]), [ (b"c", 2, len(dup_content)), (b"i", len(b"2 extra special\n"), b""), ], ), ], block._dump(), ) class TestCaseWithGroupCompressVersionedFiles(TestCaseWithMemoryTransport): def make_test_vf( self, create_graph, keylength=1, do_cleanup=True, dir=".", inconsistency_fatal=True, ): t = self.get_transport(dir) t.ensure_base() vf = groupcompress.make_pack_factory( graph=create_graph, delta=False, keylength=keylength, inconsistency_fatal=inconsistency_fatal, )(t) if do_cleanup: self.addCleanup(groupcompress.cleanup_pack_group, vf) return vf class TestGroupCompressVersionedFiles(TestCaseWithGroupCompressVersionedFiles): def make_g_index(self, name, ref_lists=0, nodes=None): if nodes is None: nodes = [] builder = btree_index.BTreeBuilder(ref_lists) for node, references, value in nodes: builder.add_node(node, references, value) stream = builder.finish() trans = self.get_transport() size = trans.put_file(name, stream) return btree_index.BTreeGraphIndex(trans, name, size) def make_g_index_missing_parent(self): graph_index = self.make_g_index( "missing_parent", 1, [ ((b"parent",), b"2 78 2 10", ([],)), ((b"tip",), b"2 78 2 10", ([(b"parent",), (b"missing-parent",)],)), ], ) return graph_index def test_get_record_stream_as_requested(self): # Consider promoting 'as-requested' to general availability, and # make this a VF interface test vf = self.make_test_vf(False, dir="source") vf.add_lines((b"a",), (), [b"lines\n"]) vf.add_lines((b"b",), (), [b"lines\n"]) vf.add_lines((b"c",), (), [b"lines\n"]) vf.add_lines((b"d",), (), [b"lines\n"]) vf.writer.end() keys = [ record.key for record in vf.get_record_stream( [(b"a",), (b"b",), (b"c",), (b"d",)], "as-requested", False ) ] self.assertEqual([(b"a",), (b"b",), (b"c",), (b"d",)], keys) keys = [ record.key for record in vf.get_record_stream( [(b"b",), (b"a",), (b"d",), (b"c",)], "as-requested", False ) ] self.assertEqual([(b"b",), (b"a",), (b"d",), (b"c",)], keys) # It should work even after being repacked into another VF vf2 = self.make_test_vf(False, dir="target") vf2.insert_record_stream( vf.get_record_stream( [(b"b",), (b"a",), (b"d",), (b"c",)], "as-requested", False ) ) vf2.writer.end() keys = [ record.key for record in vf2.get_record_stream( [(b"a",), (b"b",), (b"c",), (b"d",)], "as-requested", False ) ] self.assertEqual([(b"a",), (b"b",), (b"c",), (b"d",)], keys) keys = [ record.key for record in vf2.get_record_stream( [(b"b",), (b"a",), (b"d",), (b"c",)], "as-requested", False ) ] self.assertEqual([(b"b",), (b"a",), (b"d",), (b"c",)], keys) def test_get_record_stream_max_bytes_to_index_default(self): vf = self.make_test_vf(True, dir="source") vf.add_lines((b"a",), (), [b"lines\n"]) vf.writer.end() record = next(vf.get_record_stream([(b"a",)], "unordered", True)) self.assertEqual( vf._DEFAULT_COMPRESSOR_SETTINGS, record._manager._get_compressor_settings() ) def test_get_record_stream_accesses_compressor_settings(self): vf = self.make_test_vf(True, dir="source") vf.add_lines((b"a",), (), [b"lines\n"]) vf.writer.end() vf._max_bytes_to_index = 1234 record = next(vf.get_record_stream([(b"a",)], "unordered", True)) self.assertEqual( {"max_bytes_to_index": 1234}, record._manager._get_compressor_settings() ) @staticmethod def grouped_stream(revision_ids, first_parents=()): parents = first_parents for revision_id in revision_ids: key = (revision_id,) record = versionedfile.FulltextContentFactory( key, parents, None, b"some content that is\n" b"identical except for\n" b"revision_id:%s\n" % (revision_id,), ) yield record parents = (key,) def test_insert_record_stream_reuses_blocks(self): vf = self.make_test_vf(True, dir="source") # One group, a-d vf.insert_record_stream(self.grouped_stream([b"a", b"b", b"c", b"d"])) # Second group, e-h vf.insert_record_stream( self.grouped_stream([b"e", b"f", b"g", b"h"], first_parents=((b"d",),)) ) block_bytes = {} stream = vf.get_record_stream( [(r.encode(),) for r in "abcdefgh"], "unordered", False ) num_records = 0 for record in stream: if record.key in [(b"a",), (b"e",)]: self.assertEqual("groupcompress-block", record.storage_kind) else: self.assertEqual("groupcompress-block-ref", record.storage_kind) block_bytes[record.key] = record._manager._block._z_content num_records += 1 self.assertEqual(8, num_records) for r in "abcd": key = (r.encode(),) self.assertIs(block_bytes[key], block_bytes[(b"a",)]) self.assertNotEqual(block_bytes[key], block_bytes[(b"e",)]) for r in "efgh": key = (r.encode(),) self.assertIs(block_bytes[key], block_bytes[(b"e",)]) self.assertNotEqual(block_bytes[key], block_bytes[(b"a",)]) # Now copy the blocks into another vf, and ensure that the blocks are # preserved without creating new entries vf2 = self.make_test_vf(True, dir="target") keys = [(r.encode(),) for r in "abcdefgh"] # ordering in 'groupcompress' order, should actually swap the groups in # the target vf, but the groups themselves should not be disturbed. def small_size_stream(): for record in vf.get_record_stream(keys, "groupcompress", False): record._manager._full_enough_block_size = ( record._manager._block._content_length ) yield record vf2.insert_record_stream(small_size_stream()) stream = vf2.get_record_stream(keys, "groupcompress", False) vf2.writer.end() num_records = 0 for record in stream: num_records += 1 self.assertEqual(block_bytes[record.key], record._manager._block._z_content) self.assertEqual(8, num_records) def test_insert_record_stream_packs_on_the_fly(self): vf = self.make_test_vf(True, dir="source") # One group, a-d vf.insert_record_stream(self.grouped_stream([b"a", b"b", b"c", b"d"])) # Second group, e-h vf.insert_record_stream( self.grouped_stream([b"e", b"f", b"g", b"h"], first_parents=((b"d",),)) ) # Now copy the blocks into another vf, and see that the # insert_record_stream rebuilt a new block on-the-fly because of # under-utilization vf2 = self.make_test_vf(True, dir="target") keys = [(r.encode(),) for r in "abcdefgh"] vf2.insert_record_stream(vf.get_record_stream(keys, "groupcompress", False)) stream = vf2.get_record_stream(keys, "groupcompress", False) vf2.writer.end() num_records = 0 # All of the records should be recombined into a single block block = None for record in stream: num_records += 1 if block is None: block = record._manager._block else: self.assertIs(block, record._manager._block) self.assertEqual(8, num_records) def test__insert_record_stream_no_reuse_block(self): vf = self.make_test_vf(True, dir="source") # One group, a-d vf.insert_record_stream(self.grouped_stream([b"a", b"b", b"c", b"d"])) # Second group, e-h vf.insert_record_stream( self.grouped_stream([b"e", b"f", b"g", b"h"], first_parents=((b"d",),)) ) vf.writer.end() keys = [(r.encode(),) for r in "abcdefgh"] self.assertEqual(8, len(list(vf.get_record_stream(keys, "unordered", False)))) # Now copy the blocks into another vf, and ensure that the blocks are # preserved without creating new entries vf2 = self.make_test_vf(True, dir="target") # ordering in 'groupcompress' order, should actually swap the groups in # the target vf, but the groups themselves should not be disturbed. list( vf2._insert_record_stream( vf.get_record_stream(keys, "groupcompress", False), reuse_blocks=False ) ) vf2.writer.end() # After inserting with reuse_blocks=False, we should have everything in # a single new block. stream = vf2.get_record_stream(keys, "groupcompress", False) block = None for record in stream: if block is None: block = record._manager._block else: self.assertIs(block, record._manager._block) def test_add_missing_noncompression_parent_unvalidated_index(self): unvalidated = self.make_g_index_missing_parent() combined = _mod_index.CombinedGraphIndex([unvalidated]) index = groupcompress._GCGraphIndex( combined, is_locked=lambda: True, parents=True, track_external_parent_refs=True, ) index.scan_unvalidated_index(unvalidated) self.assertEqual(frozenset([(b"missing-parent",)]), index.get_missing_parents()) def test_track_external_parent_refs(self): g_index = self.make_g_index("empty", 1, []) mod_index = btree_index.BTreeBuilder(1, 1) combined = _mod_index.CombinedGraphIndex([g_index, mod_index]) index = groupcompress._GCGraphIndex( combined, is_locked=lambda: True, parents=True, add_callback=mod_index.add_nodes, track_external_parent_refs=True, ) index.add_records( [((b"new-key",), b"2 10 2 10", [((b"parent-1",), (b"parent-2",))])] ) self.assertEqual( frozenset([(b"parent-1",), (b"parent-2",)]), index.get_missing_parents() ) def make_source_with_b(self, a_parent, path): source = self.make_test_vf(True, dir=path) source.add_lines((b"a",), (), [b"lines\n"]) b_parents = ((b"a",),) if a_parent else () source.add_lines((b"b",), b_parents, [b"lines\n"]) return source def do_inconsistent_inserts(self, inconsistency_fatal): target = self.make_test_vf( True, dir="target", inconsistency_fatal=inconsistency_fatal ) for x in range(2): source = self.make_source_with_b(x == 1, f"source{x}") target.insert_record_stream( source.get_record_stream([(b"b",)], "unordered", False) ) def test_inconsistent_redundant_inserts_warn(self): """Should not insert a record that is already present.""" # Capture logging warnings import io log_stream = io.StringIO() handler = logging.StreamHandler(log_stream) handler.setLevel(logging.WARNING) # Get the groupcompress logger gc_logger = logging.getLogger("bzrformats.groupcompress") gc_logger.addHandler(handler) old_level = gc_logger.level gc_logger.setLevel(logging.WARNING) try: self.do_inconsistent_inserts(inconsistency_fatal=False) finally: gc_logger.removeHandler(handler) gc_logger.setLevel(old_level) warnings = log_stream.getvalue() self.assertContainsRe( warnings, r"inconsistent details in skipped record: \(b?'b',\)" r" \(b?'42 32 0 8', \(\(\),\)\)" r" \(b?'74 32 0 8', \(\(\(b?'a',\),\),\)\)$", ) def test_inconsistent_redundant_inserts_raises(self): e = self.assertRaises( knit.KnitCorrupt, self.do_inconsistent_inserts, inconsistency_fatal=True ) self.assertContainsRe( str(e), r"Knit.* corrupt: inconsistent details" r" in add_records:" r" \(b?'b',\) \(b?'42 32 0 8', \(\(\),\)\)" r" \(b?'74 32 0 8', \(\(\(b?'a',\),\),\)\)", ) def test_clear_cache(self): vf = self.make_source_with_b(True, "source") vf.writer.end() for _record in vf.get_record_stream([(b"a",), (b"b",)], "unordered", True): pass self.assertGreater(len(vf._group_cache), 0) vf.clear_cache() self.assertEqual(0, len(vf._group_cache)) class TestGroupCompressConfig(TestCaseWithMemoryTransport): def make_test_vf(self): t = self.get_transport(".") t.ensure_base() factory = groupcompress.make_pack_factory( graph=True, delta=False, keylength=1, inconsistency_fatal=True ) vf = factory(t) self.addCleanup(groupcompress.cleanup_pack_group, vf) return vf def test_max_bytes_to_index_default(self): vf = self.make_test_vf() gc = vf._make_group_compressor() self.assertEqual(vf._DEFAULT_MAX_BYTES_TO_INDEX, vf._max_bytes_to_index) if isinstance(gc, groupcompress.PyrexGroupCompressor): self.assertEqual( vf._DEFAULT_MAX_BYTES_TO_INDEX, gc._delta_index._max_bytes_to_index ) def test_max_bytes_to_index_set_directly(self): vf = self.make_test_vf() vf._max_bytes_to_index = 10000 gc = vf._make_group_compressor() self.assertEqual(10000, vf._max_bytes_to_index) if isinstance(gc, groupcompress.PyrexGroupCompressor): self.assertEqual(10000, gc._delta_index._max_bytes_to_index) class StubGCVF: def __init__(self, canned_get_blocks=None): self._group_cache = {} self._canned_get_blocks = canned_get_blocks or [] def _get_blocks(self, read_memos): return iter(self._canned_get_blocks) class Test_BatchingBlockFetcher(TestCaseWithGroupCompressVersionedFiles): """Simple whitebox unit tests for _BatchingBlockFetcher.""" def test_add_key_new_read_memo(self): """Adding a key with an uncached read_memo new to this batch adds that read_memo to the list of memos to fetch. """ # locations are: index_memo, ignored, parents, ignored # where index_memo is: (idx, offset, len, factory_start, factory_end) # and (idx, offset, size) is known as the 'read_memo', identifying the # raw bytes needed. read_memo = ("fake index", 100, 50) locations = {("key",): (read_memo + (None, None), None, None, None)} batcher = groupcompress._BatchingBlockFetcher(StubGCVF(), locations) total_size = batcher.add_key(("key",)) self.assertEqual(50, total_size) self.assertEqual([("key",)], batcher.keys) self.assertEqual([read_memo], batcher.memos_to_get) def test_add_key_duplicate_read_memo(self): """read_memos that occur multiple times in a batch will only be fetched once. """ read_memo = ("fake index", 100, 50) # Two keys, both sharing the same read memo (but different overall # index_memos). locations = { ("key1",): (read_memo + (0, 1), None, None, None), ("key2",): (read_memo + (1, 2), None, None, None), } batcher = groupcompress._BatchingBlockFetcher(StubGCVF(), locations) total_size = batcher.add_key(("key1",)) total_size = batcher.add_key(("key2",)) self.assertEqual(50, total_size) self.assertEqual([("key1",), ("key2",)], batcher.keys) self.assertEqual([read_memo], batcher.memos_to_get) def test_add_key_cached_read_memo(self): """Adding a key with a cached read_memo will not cause that read_memo to be added to the list to fetch. """ read_memo = ("fake index", 100, 50) gcvf = StubGCVF() gcvf._group_cache[read_memo] = "fake block" locations = {("key",): (read_memo + (None, None), None, None, None)} batcher = groupcompress._BatchingBlockFetcher(gcvf, locations) total_size = batcher.add_key(("key",)) self.assertEqual(0, total_size) self.assertEqual([("key",)], batcher.keys) self.assertEqual([], batcher.memos_to_get) def test_yield_factories_empty(self): """An empty batch yields no factories.""" batcher = groupcompress._BatchingBlockFetcher(StubGCVF(), {}) self.assertEqual([], list(batcher.yield_factories())) def test_yield_factories_calls_get_blocks(self): """Uncached memos are retrieved via get_blocks.""" read_memo1 = ("fake index", 100, 50) read_memo2 = ("fake index", 150, 40) gcvf = StubGCVF( canned_get_blocks=[ (read_memo1, groupcompress.GroupCompressBlock()), (read_memo2, groupcompress.GroupCompressBlock()), ] ) locations = { ("key1",): (read_memo1 + (0, 0), None, None, None), ("key2",): (read_memo2 + (0, 0), None, None, None), } batcher = groupcompress._BatchingBlockFetcher(gcvf, locations) batcher.add_key(("key1",)) batcher.add_key(("key2",)) factories = list(batcher.yield_factories(full_flush=True)) self.assertLength(2, factories) keys = [f.key for f in factories] kinds = [f.storage_kind for f in factories] self.assertEqual([("key1",), ("key2",)], keys) self.assertEqual(["groupcompress-block", "groupcompress-block"], kinds) def test_yield_factories_flushing(self): """yield_factories holds back on yielding results from the final block unless passed full_flush=True. """ fake_block = groupcompress.GroupCompressBlock() read_memo = ("fake index", 100, 50) gcvf = StubGCVF() gcvf._group_cache[read_memo] = fake_block locations = {("key",): (read_memo + (0, 0), None, None, None)} batcher = groupcompress._BatchingBlockFetcher(gcvf, locations) batcher.add_key(("key",)) self.assertEqual([], list(batcher.yield_factories())) factories = list(batcher.yield_factories(full_flush=True)) self.assertLength(1, factories) self.assertEqual(("key",), factories[0].key) self.assertEqual("groupcompress-block", factories[0].storage_kind) class TestLazyGroupCompress(TestCaseWithMemoryTransport): _texts = { (b"key1",): b"this is a text\n" b"with a reasonable amount of compressible bytes\n" b"which can be shared between various other texts\n", (b"key2",): b"another text\n" b"with a reasonable amount of compressible bytes\n" b"which can be shared between various other texts\n", (b"key3",): b"yet another text which won't be extracted\n" b"with a reasonable amount of compressible bytes\n" b"which can be shared between various other texts\n", (b"key4",): b"this will be extracted\n" b"but references most of its bytes from\n" b"yet another text which won't be extracted\n" b"with a reasonable amount of compressible bytes\n" b"which can be shared between various other texts\n", } def make_block(self, key_to_text): """Create a GroupCompressBlock, filling it with the given texts.""" compressor = groupcompress.GroupCompressor() for key in sorted(key_to_text): compressor.compress(key, [key_to_text[key]], len(key_to_text[key]), None) locs = { key: (start, end) for key, (start, _, end, _) in compressor.labels_deltas.items() } block = compressor.flush() raw_bytes = block.to_bytes() return locs, groupcompress.GroupCompressBlock.from_bytes(raw_bytes) def add_key_to_manager(self, key, locations, block, manager): start, end = locations[key] manager.add_factory(key, (), start, end) def make_block_and_full_manager(self, texts): locations, block = self.make_block(texts) manager = groupcompress._LazyGroupContentManager(block) for key in sorted(texts): self.add_key_to_manager(key, locations, block, manager) return block, manager def test_get_fulltexts(self): locations, block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(block) self.add_key_to_manager((b"key1",), locations, block, manager) self.add_key_to_manager((b"key2",), locations, block, manager) result_order = [] for record in manager.get_record_stream(): result_order.append(record.key) text = self._texts[record.key] self.assertEqual(text, record.get_bytes_as("fulltext")) self.assertEqual([(b"key1",), (b"key2",)], result_order) # If we build the manager in the opposite order, we should get them # back in the opposite order manager = groupcompress._LazyGroupContentManager(block) self.add_key_to_manager((b"key2",), locations, block, manager) self.add_key_to_manager((b"key1",), locations, block, manager) result_order = [] for record in manager.get_record_stream(): result_order.append(record.key) text = self._texts[record.key] self.assertEqual(text, record.get_bytes_as("fulltext")) self.assertEqual([(b"key2",), (b"key1",)], result_order) def test__wire_bytes_no_keys(self): _locations, block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(block) wire_bytes = manager._wire_bytes() block_length = len(block.to_bytes()) # We should have triggered a strip, since we aren't using any content stripped_block = manager._block.to_bytes() self.assertGreater(block_length, len(stripped_block)) empty_z_header = zlib.compress(b"") self.assertEqual( b"groupcompress-block\n" b"8\n" # len(compress('')) b"0\n" # len('') b"%d\n" # compressed block len b"%s" # zheader b"%s" % (len(stripped_block), empty_z_header, stripped_block), # block wire_bytes, ) def test__wire_bytes(self): locations, block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(block) self.add_key_to_manager((b"key1",), locations, block, manager) self.add_key_to_manager((b"key4",), locations, block, manager) block_bytes = block.to_bytes() wire_bytes = manager._wire_bytes() (storage_kind, z_header_len, header_len, block_len, rest) = wire_bytes.split( b"\n", 4 ) z_header_len = int(z_header_len) header_len = int(header_len) block_len = int(block_len) self.assertEqual(b"groupcompress-block", storage_kind) self.assertEqual(34, z_header_len) self.assertEqual(26, header_len) self.assertEqual(len(block_bytes), block_len) z_header = rest[:z_header_len] header = zlib.decompress(z_header) self.assertEqual(header_len, len(header)) entry1 = locations[(b"key1",)] entry4 = locations[(b"key4",)] self.assertEqualDiff( b"key1\n" b"\n" # no parents b"%d\n" # start offset b"%d\n" # end offset b"key4\n" b"\n" b"%d\n" b"%d\n" % (entry1[0], entry1[1], entry4[0], entry4[1]), header, ) z_block = rest[z_header_len:] self.assertEqual(block_bytes, z_block) def test_from_bytes(self): locations, block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(block) self.add_key_to_manager((b"key1",), locations, block, manager) self.add_key_to_manager((b"key4",), locations, block, manager) wire_bytes = manager._wire_bytes() self.assertStartsWith(wire_bytes, b"groupcompress-block\n") manager = groupcompress._LazyGroupContentManager.from_bytes(wire_bytes) self.assertIsInstance(manager, groupcompress._LazyGroupContentManager) self.assertEqual(2, len(manager._factories)) self.assertEqual(block._z_content, manager._block._z_content) result_order = [] for record in manager.get_record_stream(): result_order.append(record.key) text = self._texts[record.key] self.assertEqual(text, record.get_bytes_as("fulltext")) self.assertEqual([(b"key1",), (b"key4",)], result_order) def test__check_rebuild_no_changes(self): block, manager = self.make_block_and_full_manager(self._texts) manager._check_rebuild_block() self.assertIs(block, manager._block) def test__check_rebuild_only_one(self): locations, block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(block) # Request just the first key, which should trigger a 'strip' action self.add_key_to_manager((b"key1",), locations, block, manager) manager._check_rebuild_block() self.assertIsNot(block, manager._block) self.assertGreater(block._content_length, manager._block._content_length) # We should be able to still get the content out of this block, though # it should only have 1 entry for record in manager.get_record_stream(): self.assertEqual((b"key1",), record.key) self.assertEqual(self._texts[record.key], record.get_bytes_as("fulltext")) def test__check_rebuild_middle(self): locations, block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(block) # Request a small key in the middle should trigger a 'rebuild' self.add_key_to_manager((b"key4",), locations, block, manager) manager._check_rebuild_block() self.assertIsNot(block, manager._block) self.assertGreater(block._content_length, manager._block._content_length) for record in manager.get_record_stream(): self.assertEqual((b"key4",), record.key) self.assertEqual(self._texts[record.key], record.get_bytes_as("fulltext")) def test_manager_default_compressor_settings(self): _locations, old_block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(old_block) gcvf = groupcompress.GroupCompressVersionedFiles # It doesn't greedily evaluate _max_bytes_to_index self.assertIs(None, manager._compressor_settings) self.assertEqual( gcvf._DEFAULT_COMPRESSOR_SETTINGS, manager._get_compressor_settings() ) def test_manager_custom_compressor_settings(self): _locations, old_block = self.make_block(self._texts) called = [] def compressor_settings(): called.append("called") return (10,) manager = groupcompress._LazyGroupContentManager( old_block, get_compressor_settings=compressor_settings ) # It doesn't greedily evaluate compressor_settings self.assertIs(None, manager._compressor_settings) self.assertEqual((10,), manager._get_compressor_settings()) self.assertEqual((10,), manager._get_compressor_settings()) self.assertEqual((10,), manager._compressor_settings) # Only called 1 time self.assertEqual(["called"], called) def test__rebuild_handles_compressor_settings(self): if not isinstance( groupcompress.GroupCompressor, groupcompress.PyrexGroupCompressor ): raise TestNotApplicable( "pure-python compressor does not handle compressor_settings" ) locations, old_block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager( old_block, get_compressor_settings=lambda: {"max_bytes_to_index": 32} ) gc = manager._make_group_compressor() self.assertEqual(32, gc._delta_index._max_bytes_to_index) self.add_key_to_manager((b"key3",), locations, old_block, manager) self.add_key_to_manager((b"key4",), locations, old_block, manager) action, _last_byte, _total_bytes = manager._check_rebuild_action() self.assertEqual("rebuild", action) manager._rebuild_block() new_block = manager._block self.assertIsNot(old_block, new_block) # Because of the new max_bytes_to_index, we do a poor job of # rebuilding. This is a side-effect of the change, but at least it does # show the setting had an effect. self.assertLess(old_block._content_length, new_block._content_length) def test_check_is_well_utilized_all_keys(self): block, manager = self.make_block_and_full_manager(self._texts) self.assertFalse(manager.check_is_well_utilized()) # Though we can fake it by changing the recommended minimum size manager._full_enough_block_size = block._content_length self.assertTrue(manager.check_is_well_utilized()) # Setting it just above causes it to fail manager._full_enough_block_size = block._content_length + 1 self.assertFalse(manager.check_is_well_utilized()) # Setting the mixed-block size doesn't do anything, because the content # is considered to not be 'mixed' manager._full_enough_mixed_block_size = block._content_length self.assertFalse(manager.check_is_well_utilized()) def test_check_is_well_utilized_mixed_keys(self): texts = {} f1k1 = (b"f1", b"k1") f1k2 = (b"f1", b"k2") f2k1 = (b"f2", b"k1") f2k2 = (b"f2", b"k2") texts[f1k1] = self._texts[(b"key1",)] texts[f1k2] = self._texts[(b"key2",)] texts[f2k1] = self._texts[(b"key3",)] texts[f2k2] = self._texts[(b"key4",)] block, manager = self.make_block_and_full_manager(texts) self.assertFalse(manager.check_is_well_utilized()) manager._full_enough_block_size = block._content_length self.assertTrue(manager.check_is_well_utilized()) manager._full_enough_block_size = block._content_length + 1 self.assertFalse(manager.check_is_well_utilized()) manager._full_enough_mixed_block_size = block._content_length self.assertTrue(manager.check_is_well_utilized()) def test_check_is_well_utilized_partial_use(self): locations, block = self.make_block(self._texts) manager = groupcompress._LazyGroupContentManager(block) manager._full_enough_block_size = block._content_length self.add_key_to_manager((b"key1",), locations, block, manager) self.add_key_to_manager((b"key2",), locations, block, manager) # Just using the content from key1 and 2 is not enough to be considered # 'complete' self.assertFalse(manager.check_is_well_utilized()) # However if we add key3, then we have enough, as we only require 75% # consumption self.add_key_to_manager((b"key4",), locations, block, manager) self.assertTrue(manager.check_is_well_utilized()) class Test_GCBuildDetails(TestCase): def test_acts_like_tuple(self): # _GCBuildDetails inlines some of the data that used to be spread out # across a bunch of tuples bd = groupcompress._GCBuildDetails( (("parent1",), ("parent2",)), ("INDEX", 10, 20, 0, 5) ) self.assertEqual(4, len(bd)) self.assertEqual(("INDEX", 10, 20, 0, 5), bd[0]) self.assertEqual(None, bd[1]) # Compression Parent is always None self.assertEqual((("parent1",), ("parent2",)), bd[2]) self.assertEqual(("group", None), bd[3]) # Record details def test__repr__(self): bd = groupcompress._GCBuildDetails( (("parent1",), ("parent2",)), ("INDEX", 10, 20, 0, 5) ) self.assertEqual( "_GCBuildDetails(('INDEX', 10, 20, 0, 5), (('parent1',), ('parent2',)))", repr(bd), ) class TestBase128Int(TestCase): def assertEqualEncode(self, bytes, val): self.assertEqual(bytes, groupcompress.encode_base128_int(val)) def assertEqualDecode(self, val, num_decode, bytes): self.assertEqual((val, num_decode), groupcompress.decode_base128_int(bytes)) def test_encode(self): self.assertEqualEncode(b"\x01", 1) self.assertEqualEncode(b"\x02", 2) self.assertEqualEncode(b"\x7f", 127) self.assertEqualEncode(b"\x80\x01", 128) self.assertEqualEncode(b"\xff\x01", 255) self.assertEqualEncode(b"\x80\x02", 256) self.assertEqualEncode(b"\xff\xff\xff\xff\x0f", 0xFFFFFFFF) def test_decode(self): self.assertEqualDecode(1, 1, b"\x01") self.assertEqualDecode(2, 1, b"\x02") self.assertEqualDecode(127, 1, b"\x7f") self.assertEqualDecode(128, 2, b"\x80\x01") self.assertEqualDecode(255, 2, b"\xff\x01") self.assertEqualDecode(256, 2, b"\x80\x02") self.assertEqualDecode(0xFFFFFFFF, 5, b"\xff\xff\xff\xff\x0f") def test_decode_with_trailing_bytes(self): self.assertEqualDecode(1, 1, b"\x01abcdef") self.assertEqualDecode(127, 1, b"\x7f\x01") self.assertEqualDecode(128, 2, b"\x80\x01abcdef") self.assertEqualDecode(255, 2, b"\xff\x01\xff") bzrformats_3.4.0.orig/bzrformats/tests/test_hashcache.py0000644000000000000000000001006315162115103020524 0ustar00# Copyright (C) 2005-2011, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA import os import time from bzrformats import osutils from .. import hashcache from . import TestCaseInTempDir sha1 = osutils.sha_string def pause(): time.sleep(5.0) class TestHashCache(TestCaseInTempDir): """Test the hashcache against a real directory.""" def make_hashcache(self): # make a dummy bzr directory just to hold the cache os.mkdir(".bzr") hc = hashcache.HashCache(".", ".bzr/stat-cache") return hc def reopen_hashcache(self): hc = hashcache.HashCache(".", ".bzr/stat-cache") hc.read() return hc def test_hashcache_initial_miss(self): """Get correct hash from an empty hashcache.""" hc = self.make_hashcache() self.build_tree_contents([("foo", b"hello")]) self.assertEqual( hc.get_sha1("foo"), b"aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d" ) self.assertEqual(hc.miss_count, 1) self.assertEqual(hc.hit_count, 0) def test_hashcache_new_file(self): hc = self.make_hashcache() self.build_tree_contents([("foo", b"goodbye")]) # now read without pausing; it may not be possible to cache it as its # so new self.assertEqual(hc.get_sha1("foo"), sha1(b"goodbye")) def test_hashcache_nonexistent_file(self): hc = self.make_hashcache() self.assertEqual(hc.get_sha1("no-name-yet"), None) def test_hashcache_replaced_file(self): hc = self.make_hashcache() self.build_tree_contents([("foo", b"goodbye")]) self.assertEqual(hc.get_sha1("foo"), sha1(b"goodbye")) os.remove("foo") self.assertEqual(hc.get_sha1("foo"), None) self.build_tree_contents([("foo", b"new content")]) self.assertEqual(hc.get_sha1("foo"), sha1(b"new content")) def test_hashcache_not_file(self): hc = self.make_hashcache() self.build_tree(["subdir/"]) self.assertEqual(hc.get_sha1("subdir"), None) def test_hashcache_load(self): hc = self.make_hashcache() self.build_tree_contents([("foo", b"contents")]) pause() self.assertEqual(hc.get_sha1("foo"), sha1(b"contents")) hc.write() hc = self.reopen_hashcache() self.assertEqual(hc.get_sha1("foo"), sha1(b"contents")) self.assertEqual(hc.hit_count, 1) def test_hammer_hashcache(self): hc = self.make_hashcache() for i in range(10000): with open("foo", "wb") as f: last_content = b"%08x" % i f.write(last_content) last_sha1 = sha1(last_content) self.log("iteration %d: %r -> %r", i, last_content, last_sha1) got_sha1 = hc.get_sha1("foo") self.assertEqual(got_sha1, last_sha1) hc.write() hc = self.reopen_hashcache() def test_hashcache_raise(self): """Check that hashcache can raise BzrError.""" if getattr(os, "mkfifo", None) is None: self.skipTest("os.mkfifo not available") hc = self.make_hashcache() os.mkfifo("a") # It's possible that the system supports fifos but the filesystem # can't. In that case we should skip at this point. But in fact # such combinations don't usually occur for the filesystem where # people test bzr. self.assertRaises(OSError, hc.get_sha1, "a") bzrformats_3.4.0.orig/bzrformats/tests/test_index.py0000644000000000000000000027144515162115107017745 0ustar00# Copyright (C) 2007-2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for indices.""" from .. import index as _mod_index from ..transport import TracingTransport, TransportNoSuchFile from . import TestCase, TestCaseWithMemoryTransport class ErrorTests(TestCase): """Tests for index error classes.""" def test_bad_index_format_signature(self): """Test bad index format signature.""" error = _mod_index.BadIndexFormatSignature("foo", "bar") self.assertEqual("foo is not an index of type bar.", str(error)) def test_bad_index_data(self): """Test bad index data.""" error = _mod_index.BadIndexData("foo") self.assertEqual("Error in data for index foo.", str(error)) def test_bad_index_duplicate_key(self): """Test bad index duplicate key.""" error = _mod_index.BadIndexDuplicateKey("foo", "bar") self.assertEqual("The key 'foo' is already in index 'bar'.", str(error)) def test_bad_index_key(self): """Test bad index key.""" error = _mod_index.BadIndexKey("foo") self.assertEqual("The key 'foo' is not a valid key.", str(error)) def test_bad_index_options(self): """Test bad index options.""" error = _mod_index.BadIndexOptions("foo") self.assertEqual("Could not parse options for index foo.", str(error)) def test_bad_index_value(self): """Test bad index value.""" error = _mod_index.BadIndexValue("foo") self.assertEqual("The value 'foo' is not a valid value.", str(error)) class TestGraphIndexBuilder(TestCaseWithMemoryTransport): """Tests for Graph Index Builder.""" def test_build_index_empty(self): """Test build index empty.""" builder = _mod_index.GraphIndexBuilder() stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=1\nlen=0\n\n", contents, ) def test_build_index_empty_two_element_keys(self): """Test build index empty two element keys.""" builder = _mod_index.GraphIndexBuilder(key_elements=2) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=2\nlen=0\n\n", contents, ) def test_build_index_one_reference_list_empty(self): """Test build index one reference list empty.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=1\nkey_elements=1\nlen=0\n\n", contents, ) def test_build_index_two_reference_list_empty(self): """Test build index two reference list empty.""" builder = _mod_index.GraphIndexBuilder(reference_lists=2) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=2\nkey_elements=1\nlen=0\n\n", contents, ) def test_build_index_one_node_no_refs(self): """Test build index one node no refs.""" builder = _mod_index.GraphIndexBuilder() builder.add_node((b"akey",), b"data") stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=1\nlen=1\n" b"akey\x00\x00\x00data\n\n", contents, ) def test_build_index_one_node_no_refs_accepts_empty_reflist(self): """Test build index one node no refs accepts empty reflist.""" builder = _mod_index.GraphIndexBuilder() builder.add_node((b"akey",), b"data", ()) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=1\nlen=1\n" b"akey\x00\x00\x00data\n\n", contents, ) def test_build_index_one_node_2_element_keys(self): """Test build index one node 2 element keys.""" # multipart keys are separated by \x00 - because they are fixed length, # not variable this does not cause any issues, and seems clearer to the # author. builder = _mod_index.GraphIndexBuilder(key_elements=2) builder.add_node((b"akey", b"secondpart"), b"data") stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=2\nlen=1\n" b"akey\x00secondpart\x00\x00\x00data\n\n", contents, ) def test_add_node_empty_value(self): """Test add node empty value.""" builder = _mod_index.GraphIndexBuilder() builder.add_node((b"akey",), b"") stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=1\nlen=1\n" b"akey\x00\x00\x00\n\n", contents, ) def test_build_index_nodes_sorted(self): """Test build index nodes sorted.""" # the highest sorted node comes first. builder = _mod_index.GraphIndexBuilder() # use three to have a good chance of glitching dictionary hash # lookups etc. Insert in randomish order that is not correct # and not the reverse of the correct order. builder.add_node((b"2002",), b"data") builder.add_node((b"2000",), b"data") builder.add_node((b"2001",), b"data") stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=1\nlen=3\n" b"2000\x00\x00\x00data\n" b"2001\x00\x00\x00data\n" b"2002\x00\x00\x00data\n" b"\n", contents, ) def test_build_index_2_element_key_nodes_sorted(self): """Test build index 2 element key nodes sorted.""" # multiple element keys are sorted first-key, second-key. builder = _mod_index.GraphIndexBuilder(key_elements=2) # use three values of each key element, to have a good chance of # glitching dictionary hash lookups etc. Insert in randomish order that # is not correct and not the reverse of the correct order. builder.add_node((b"2002", b"2002"), b"data") builder.add_node((b"2002", b"2000"), b"data") builder.add_node((b"2002", b"2001"), b"data") builder.add_node((b"2000", b"2002"), b"data") builder.add_node((b"2000", b"2000"), b"data") builder.add_node((b"2000", b"2001"), b"data") builder.add_node((b"2001", b"2002"), b"data") builder.add_node((b"2001", b"2000"), b"data") builder.add_node((b"2001", b"2001"), b"data") stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=0\nkey_elements=2\nlen=9\n" b"2000\x002000\x00\x00\x00data\n" b"2000\x002001\x00\x00\x00data\n" b"2000\x002002\x00\x00\x00data\n" b"2001\x002000\x00\x00\x00data\n" b"2001\x002001\x00\x00\x00data\n" b"2001\x002002\x00\x00\x00data\n" b"2002\x002000\x00\x00\x00data\n" b"2002\x002001\x00\x00\x00data\n" b"2002\x002002\x00\x00\x00data\n" b"\n", contents, ) def test_build_index_reference_lists_are_included_one(self): """Test build index reference lists are included one.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1) builder.add_node((b"key",), b"data", ([],)) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=1\nkey_elements=1\nlen=1\n" b"key\x00\x00\x00data\n" b"\n", contents, ) def test_build_index_reference_lists_with_2_element_keys(self): """Test build index reference lists with 2 element keys.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1, key_elements=2) builder.add_node((b"key", b"key2"), b"data", ([],)) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=1\nkey_elements=2\nlen=1\n" b"key\x00key2\x00\x00\x00data\n" b"\n", contents, ) def test_build_index_reference_lists_are_included_two(self): """Test build index reference lists are included two.""" builder = _mod_index.GraphIndexBuilder(reference_lists=2) builder.add_node((b"key",), b"data", ([], [])) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=2\nkey_elements=1\nlen=1\n" b"key\x00\x00\t\x00data\n" b"\n", contents, ) def test_clear_cache(self): """Test clear cache.""" builder = _mod_index.GraphIndexBuilder(reference_lists=2) # This is a no-op, but the api should exist builder.clear_cache() def test_node_references_are_byte_offsets(self): """Test node references are byte offsets.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1) builder.add_node((b"reference",), b"data", ([],)) builder.add_node((b"key",), b"data", ([(b"reference",)],)) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=1\nkey_elements=1\nlen=2\n" b"key\x00\x0072\x00data\n" b"reference\x00\x00\x00data\n" b"\n", contents, ) def test_node_references_are_cr_delimited(self): """Test node references are cr delimited.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1) builder.add_node((b"reference",), b"data", ([],)) builder.add_node((b"reference2",), b"data", ([],)) builder.add_node((b"key",), b"data", ([(b"reference",), (b"reference2",)],)) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=1\nkey_elements=1\nlen=3\n" b"key\x00\x00077\r094\x00data\n" b"reference\x00\x00\x00data\n" b"reference2\x00\x00\x00data\n" b"\n", contents, ) def test_multiple_reference_lists_are_tab_delimited(self): """Test multiple reference lists are tab delimited.""" builder = _mod_index.GraphIndexBuilder(reference_lists=2) builder.add_node((b"keference",), b"data", ([], [])) builder.add_node((b"rey",), b"data", ([(b"keference",)], [(b"keference",)])) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=2\nkey_elements=1\nlen=2\n" b"keference\x00\x00\t\x00data\n" b"rey\x00\x0059\t59\x00data\n" b"\n", contents, ) def test_add_node_referencing_missing_key_makes_absent(self): """Test add node referencing missing key makes absent.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1) builder.add_node((b"rey",), b"data", ([(b"beference",), (b"aeference2",)],)) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=1\nkey_elements=1\nlen=1\n" b"aeference2\x00a\x00\x00\n" b"beference\x00a\x00\x00\n" b"rey\x00\x00074\r059\x00data\n" b"\n", contents, ) def test_node_references_three_digits(self): """Test node references three digits.""" # test the node digit expands as needed. builder = _mod_index.GraphIndexBuilder(reference_lists=1) references = [((b"%d" % val),) for val in range(8, -1, -1)] builder.add_node((b"2-key",), b"", (references,)) stream = builder.finish() contents = stream.read() self.assertEqualDiff( b"Bazaar Graph Index 1\nnode_ref_lists=1\nkey_elements=1\nlen=1\n" b"0\x00a\x00\x00\n" b"1\x00a\x00\x00\n" b"2\x00a\x00\x00\n" b"2-key\x00\x00151\r145\r139\r133\r127\r121\r071\r065\r059\x00\n" b"3\x00a\x00\x00\n" b"4\x00a\x00\x00\n" b"5\x00a\x00\x00\n" b"6\x00a\x00\x00\n" b"7\x00a\x00\x00\n" b"8\x00a\x00\x00\n" b"\n", contents, ) def test_absent_has_no_reference_overhead(self): """Test absent has no reference overhead.""" # the offsets after an absent record should be correct when there are # >1 reference lists. builder = _mod_index.GraphIndexBuilder(reference_lists=2) builder.add_node((b"parent",), b"", ([(b"aail",), (b"zther",)], [])) stream = builder.finish() contents = stream.read() self.assertEqual( b"Bazaar Graph Index 1\nnode_ref_lists=2\nkey_elements=1\nlen=1\n" b"aail\x00a\x00\x00\n" b"parent\x00\x0059\r84\t\x00\n" b"zther\x00a\x00\x00\n" b"\n", contents, ) def test_add_node_bad_key(self): """Test add node bad key.""" builder = _mod_index.GraphIndexBuilder() for bad_char in bytearray(b"\t\n\x0b\x0c\r\x00 "): self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"a%skey" % bytes([bad_char]),), b"data", ) self.assertRaises(_mod_index.BadIndexKey, builder.add_node, (), b"data") self.assertRaises( _mod_index.BadIndexKey, builder.add_node, b"not-a-tuple", b"data" ) # not enough length self.assertRaises(_mod_index.BadIndexKey, builder.add_node, (), b"data") # too long self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"primary", b"secondary"), b"data", ) # secondary key elements get checked too: builder = _mod_index.GraphIndexBuilder(key_elements=2) for bad_char in bytearray(b"\t\n\x0b\x0c\r\x00 "): self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"prefix", b"a%skey" % bytes([bad_char])), b"data", ) def test_add_node_bad_data(self): """Test add node bad data.""" builder = _mod_index.GraphIndexBuilder() self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data\naa" ) self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data\x00aa" ) def test_add_node_bad_mismatched_ref_lists_length(self): """Test add node bad mismatched ref lists length.""" builder = _mod_index.GraphIndexBuilder() self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data aa", ([],) ) builder = _mod_index.GraphIndexBuilder(reference_lists=1) self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data aa" ) self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data aa", (), ) self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data aa", ([], []) ) builder = _mod_index.GraphIndexBuilder(reference_lists=2) self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data aa" ) self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data aa", ([],) ) self.assertRaises( _mod_index.BadIndexValue, builder.add_node, (b"akey",), b"data aa", ([], [], []), ) def test_add_node_bad_key_in_reference_lists(self): """Test add node bad key in reference lists.""" # first list, first key - trivial builder = _mod_index.GraphIndexBuilder(reference_lists=1) self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"akey",), b"data aa", ([(b"a key",)],), ) # references keys must be tuples too self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"akey",), b"data aa", (["not-a-tuple"],), ) # not enough length self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"akey",), b"data aa", ([()],) ) # too long self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"akey",), b"data aa", ([(b"primary", b"secondary")],), ) # need to check more than the first key in the list self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"akey",), b"data aa", ([(b"agoodkey",), (b"that is a bad key",)],), ) # and if there is more than one list it should be getting checked # too builder = _mod_index.GraphIndexBuilder(reference_lists=2) self.assertRaises( _mod_index.BadIndexKey, builder.add_node, (b"akey",), b"data aa", ([], ["a bad key"]), ) def test_add_duplicate_key(self): """Test add duplicate key.""" builder = _mod_index.GraphIndexBuilder() builder.add_node((b"key",), b"data") self.assertRaises( _mod_index.BadIndexDuplicateKey, builder.add_node, (b"key",), b"data" ) def test_add_duplicate_key_2_elements(self): """Test add duplicate key 2 elements.""" builder = _mod_index.GraphIndexBuilder(key_elements=2) builder.add_node((b"key", b"key"), b"data") self.assertRaises( _mod_index.BadIndexDuplicateKey, builder.add_node, (b"key", b"key"), b"data" ) def test_add_key_after_referencing_key(self): """Test add key after referencing key.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1) builder.add_node((b"key",), b"data", ([(b"reference",)],)) builder.add_node((b"reference",), b"data", ([],)) def test_add_key_after_referencing_key_2_elements(self): """Test add key after referencing key 2 elements.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1, key_elements=2) builder.add_node((b"k", b"ey"), b"data", ([(b"reference", b"tokey")],)) builder.add_node((b"reference", b"tokey"), b"data", ([],)) def test_set_optimize(self): """Test set optimize.""" builder = _mod_index.GraphIndexBuilder(reference_lists=1, key_elements=2) builder.set_optimize(for_size=True) self.assertTrue(builder._optimize_for_size) builder.set_optimize(for_size=False) self.assertFalse(builder._optimize_for_size) class TestGraphIndex(TestCaseWithMemoryTransport): """Tests for Graph Index.""" def make_key(self, number): """Make key.""" return ((b"%d" % number) + b"X" * 100,) def make_value(self, number): """Make value.""" return (b"%d" % number) + b"Y" * 100 def make_nodes(self, count=64): """Make nodes.""" # generate a big enough index that we only read some of it on a typical # bisection lookup. nodes = [] for counter in range(count): nodes.append((self.make_key(counter), self.make_value(counter), ())) return nodes def make_index(self, ref_lists=0, key_elements=1, nodes=None): """Make index.""" if nodes is None: nodes = [] builder = _mod_index.GraphIndexBuilder(ref_lists, key_elements=key_elements) for key, value, references in nodes: builder.add_node(key, value, references) stream = builder.finish() trans = TracingTransport(self.get_transport()) size = trans.put_file("index", stream) return _mod_index.GraphIndex(trans, "index", size) def make_index_with_offset(self, ref_lists=0, key_elements=1, nodes=None, offset=0): """Make index with offset.""" if nodes is None: nodes = [] builder = _mod_index.GraphIndexBuilder(ref_lists, key_elements=key_elements) for key, value, references in nodes: builder.add_node(key, value, references) content = builder.finish().read() size = len(content) trans = self.get_transport() trans.put_bytes("index", (b" " * offset) + content) return _mod_index.GraphIndex(trans, "index", size, offset=offset) def test_clear_cache(self): """Test clear cache.""" index = self.make_index() # For now, we just want to make sure the api is available. As this is # old code, we don't really worry if it *does* anything. index.clear_cache() def test_open_bad_index_no_error(self): """Test open bad index no error.""" trans = self.get_transport() trans.put_bytes("name", b"not an index\n") _mod_index.GraphIndex(trans, "name", 13) def test_with_offset(self): """Test with offset.""" nodes = self.make_nodes(200) idx = self.make_index_with_offset(offset=1234567, nodes=nodes) self.assertEqual(200, idx.key_count()) def test_buffer_all_with_offset(self): """Test buffer all with offset.""" nodes = self.make_nodes(200) idx = self.make_index_with_offset(offset=1234567, nodes=nodes) idx._buffer_all() self.assertEqual(200, idx.key_count()) def test_side_effect_buffering_with_offset(self): """Test side effect buffering with offset.""" nodes = self.make_nodes(20) index = self.make_index_with_offset(offset=1234567, nodes=nodes) index._transport.recommended_page_size = lambda: 64 * 1024 subset_nodes = [nodes[0][0], nodes[10][0], nodes[19][0]] entries = [n[1] for n in index.iter_entries(subset_nodes)] self.assertEqual(sorted(subset_nodes), sorted(entries)) self.assertEqual(20, index.key_count()) def test_open_sets_parsed_map_empty(self): """Test open sets parsed map empty.""" index = self.make_index() self.assertEqual([], index._parsed_byte_map) self.assertEqual([], index._parsed_key_map) def test_key_count_buffers(self): """Test key count buffers.""" index = self.make_index(nodes=self.make_nodes(2)) # reset the transport log del index._transport._activity[:] self.assertEqual(2, index.key_count()) # We should have requested reading the header bytes self.assertEqual( [ ("readv", "index", [(0, 200)], True, index._size), ], index._transport._activity, ) # And that should have been enough to trigger reading the whole index # with buffering self.assertIsNot(None, index._nodes) def test_lookup_key_via_location_buffers(self): """Test lookup key via location buffers.""" index = self.make_index() # reset the transport log del index._transport._activity[:] # do a _lookup_keys_via_location call for the middle of the file, which # is what bisection uses. result = index._lookup_keys_via_location([(index._size // 2, (b"missing",))]) # this should have asked for a readv request, with adjust_for_latency, # and two regions: the header, and half-way into the file. self.assertEqual( [ ("readv", "index", [(30, 30), (0, 200)], True, 60), ], index._transport._activity, ) # and the result should be that the key cannot be present, because this # is a trivial index. self.assertEqual([((index._size // 2, (b"missing",)), False)], result) # And this should have caused the file to be fully buffered self.assertIsNot(None, index._nodes) self.assertEqual([], index._parsed_byte_map) def test_first_lookup_key_via_location(self): """Test first lookup key via location.""" # We need enough data so that the _HEADER_READV doesn't consume the # whole file. We always read 800 bytes for every key, and the local # transport natural expansion is 4096 bytes. So we have to have >8192 # bytes or we will trigger "buffer_all". # We also want the 'missing' key to fall within the range that *did* # read index = self.make_index(nodes=self.make_nodes(64)) # reset the transport log del index._transport._activity[:] # do a _lookup_keys_via_location call for the middle of the file, which # is what bisection uses. start_lookup = index._size // 2 result = index._lookup_keys_via_location([(start_lookup, (b"40missing",))]) # this should have asked for a readv request, with adjust_for_latency, # and two regions: the header, and half-way into the file. self.assertEqual( [ ("readv", "index", [(start_lookup, 800), (0, 200)], True, index._size), ], index._transport._activity, ) # and the result should be that the key cannot be present, because this # is a trivial index. self.assertEqual([((start_lookup, (b"40missing",)), False)], result) # And this should not have caused the file to be fully buffered self.assertIs(None, index._nodes) # And the regions of the file that have been parsed should be in the # parsed_byte_map and the parsed_key_map self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(26)), (self.make_key(31), self.make_key(48))], index._parsed_key_map, ) def test_parsing_non_adjacent_data_trims(self): """Test parsing non adjacent data trims.""" index = self.make_index(nodes=self.make_nodes(64)) result = index._lookup_keys_via_location([(index._size // 2, (b"40",))]) # and the result should be that the key cannot be present, because key is # in the middle of the observed data from a 4K read - the smallest transport # will do today with this api. self.assertEqual([((index._size // 2, (b"40",)), False)], result) # and we should have a parse map that includes the header and the # region that was parsed after trimming. self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(26)), (self.make_key(31), self.make_key(48))], index._parsed_key_map, ) def test_parsing_data_handles_parsed_contained_regions(self): """Test parsing data handles parsed contained regions.""" # the following patten creates a parsed region that is wholly within a # single result from the readv layer: # .... single-read (readv-minimum-size) ... # which then trims the start and end so the parsed size is < readv # miniumum. # then a dual lookup (or a reference lookup for that matter) which # abuts or overlaps the parsed region on both sides will need to # discard the data in the middle, but parse the end as well. # # we test this by doing a single lookup to seed the data, then # a lookup for two keys that are present, and adjacent - # we except both to be found, and the parsed byte map to include the # locations of both keys. index = self.make_index(nodes=self.make_nodes(128)) result = index._lookup_keys_via_location([(index._size // 2, (b"40",))]) # and we should have a parse map that includes the header and the # region that was parsed after trimming. self.assertEqual([(0, 4045), (11759, 15707)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(116)), (self.make_key(35), self.make_key(51))], index._parsed_key_map, ) # now ask for two keys, right before and after the parsed region result = index._lookup_keys_via_location( [(11450, self.make_key(34)), (15707, self.make_key(52))] ) self.assertEqual( [ ( (11450, self.make_key(34)), (index, self.make_key(34), self.make_value(34)), ), ( (15707, self.make_key(52)), (index, self.make_key(52), self.make_value(52)), ), ], result, ) self.assertEqual([(0, 4045), (9889, 17993)], index._parsed_byte_map) def test_lookup_missing_key_answers_without_io_when_map_permits(self): """Test lookup missing key answers without io when map permits.""" # generate a big enough index that we only read some of it on a typical # bisection lookup. index = self.make_index(nodes=self.make_nodes(64)) # lookup the keys in the middle of the file result = index._lookup_keys_via_location([(index._size // 2, (b"40",))]) # check the parse map, this determines the test validity self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(26)), (self.make_key(31), self.make_key(48))], index._parsed_key_map, ) # reset the transport log del index._transport._activity[:] # now looking up a key in the portion of the file already parsed should # not create a new transport request, and should return False (cannot # be in the index) - even when the byte location we ask for is outside # the parsed region result = index._lookup_keys_via_location([(4000, (b"40",))]) self.assertEqual([((4000, (b"40",)), False)], result) self.assertEqual([], index._transport._activity) def test_lookup_present_key_answers_without_io_when_map_permits(self): """Test lookup present key answers without io when map permits.""" # generate a big enough index that we only read some of it on a typical # bisection lookup. index = self.make_index(nodes=self.make_nodes(64)) # lookup the keys in the middle of the file result = index._lookup_keys_via_location([(index._size // 2, (b"40",))]) # check the parse map, this determines the test validity self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(26)), (self.make_key(31), self.make_key(48))], index._parsed_key_map, ) # reset the transport log del index._transport._activity[:] # now looking up a key in the portion of the file already parsed should # not create a new transport request, and should return False (cannot # be in the index) - even when the byte location we ask for is outside # the parsed region # result = index._lookup_keys_via_location([(4000, self.make_key(40))]) self.assertEqual( [ ( (4000, self.make_key(40)), (index, self.make_key(40), self.make_value(40)), ) ], result, ) self.assertEqual([], index._transport._activity) def test_lookup_key_below_probed_area(self): """Test lookup key below probed area.""" # generate a big enough index that we only read some of it on a typical # bisection lookup. index = self.make_index(nodes=self.make_nodes(64)) # ask for the key in the middle, but a key that is located in the # unparsed region before the middle. result = index._lookup_keys_via_location([(index._size // 2, (b"30",))]) # check the parse map, this determines the test validity self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(26)), (self.make_key(31), self.make_key(48))], index._parsed_key_map, ) self.assertEqual([((index._size // 2, (b"30",)), -1)], result) def test_lookup_key_above_probed_area(self): """Test lookup key above probed area.""" # generate a big enough index that we only read some of it on a typical # bisection lookup. index = self.make_index(nodes=self.make_nodes(64)) # ask for the key in the middle, but a key that is located in the # unparsed region after the middle. result = index._lookup_keys_via_location([(index._size // 2, (b"50",))]) # check the parse map, this determines the test validity self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(26)), (self.make_key(31), self.make_key(48))], index._parsed_key_map, ) self.assertEqual([((index._size // 2, (b"50",)), +1)], result) def test_lookup_key_resolves_references(self): """Test lookup key resolves references.""" # generate a big enough index that we only read some of it on a typical # bisection lookup. nodes = [] for counter in range(99): nodes.append( ( self.make_key(counter), self.make_value(counter), ((self.make_key(counter + 20),),), ) ) index = self.make_index(ref_lists=1, nodes=nodes) # lookup a key in the middle that does not exist, so that when we can # check that the referred-to-keys are not accessed automatically. index_size = index._size index_center = index_size // 2 result = index._lookup_keys_via_location([(index_center, (b"40",))]) # check the parse map - only the start and middle should have been # parsed. self.assertEqual([(0, 4027), (10198, 14028)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(17)), (self.make_key(44), self.make_key(5))], index._parsed_key_map, ) # and check the transport activity likewise. self.assertEqual( [("readv", "index", [(index_center, 800), (0, 200)], True, index_size)], index._transport._activity, ) # reset the transport log for testing the reference lookup del index._transport._activity[:] # now looking up a key in the portion of the file already parsed should # only perform IO to resolve its key references. result = index._lookup_keys_via_location([(11000, self.make_key(45))]) self.assertEqual( [ ( (11000, self.make_key(45)), ( index, self.make_key(45), self.make_value(45), ((self.make_key(65),),), ), ) ], result, ) self.assertEqual( [("readv", "index", [(15093, 800)], True, index_size)], index._transport._activity, ) def test_lookup_key_can_buffer_all(self): """Test lookup key can buffer all.""" nodes = [] for counter in range(64): nodes.append( ( self.make_key(counter), self.make_value(counter), ((self.make_key(counter + 20),),), ) ) index = self.make_index(ref_lists=1, nodes=nodes) # lookup a key in the middle that does not exist, so that when we can # check that the referred-to-keys are not accessed automatically. index_size = index._size index_center = index_size // 2 result = index._lookup_keys_via_location([(index_center, (b"40",))]) # check the parse map - only the start and middle should have been # parsed. self.assertEqual([(0, 3890), (6444, 10274)], index._parsed_byte_map) self.assertEqual( [((), self.make_key(25)), (self.make_key(37), self.make_key(52))], index._parsed_key_map, ) # and check the transport activity likewise. self.assertEqual( [("readv", "index", [(index_center, 800), (0, 200)], True, index_size)], index._transport._activity, ) # reset the transport log for testing the reference lookup del index._transport._activity[:] # now looking up a key in the portion of the file already parsed should # only perform IO to resolve its key references. result = index._lookup_keys_via_location([(7000, self.make_key(40))]) self.assertEqual( [ ( (7000, self.make_key(40)), ( index, self.make_key(40), self.make_value(40), ((self.make_key(60),),), ), ) ], result, ) # Resolving the references would have required more data read, and we # are already above the 50% threshold, so it triggered a _buffer_all self.assertEqual([("get", "index")], index._transport._activity) def test_iter_all_entries_empty(self): """Test iter all entries empty.""" index = self.make_index() self.assertEqual([], list(index.iter_all_entries())) def test_iter_all_entries_simple(self): """Test iter all entries simple.""" index = self.make_index(nodes=[((b"name",), b"data", ())]) self.assertEqual([(index, (b"name",), b"data")], list(index.iter_all_entries())) def test_iter_all_entries_simple_2_elements(self): """Test iter all entries simple 2 elements.""" index = self.make_index( key_elements=2, nodes=[((b"name", b"surname"), b"data", ())] ) self.assertEqual( [(index, (b"name", b"surname"), b"data")], list(index.iter_all_entries()) ) def test_iter_all_entries_references_resolved(self): """Test iter all entries references resolved.""" index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_all_entries()), ) def test_iter_entries_buffers_once(self): """Test iter entries buffers once.""" index = self.make_index(nodes=self.make_nodes(2)) # reset the transport log del index._transport._activity[:] self.assertEqual( {(index, self.make_key(1), self.make_value(1))}, set(index.iter_entries([self.make_key(1)])), ) # We should have requested reading the header bytes # But not needed any more than that because it would have triggered a # buffer all self.assertEqual( [ ("readv", "index", [(0, 200)], True, index._size), ], index._transport._activity, ) # And that should have been enough to trigger reading the whole index # with buffering self.assertIsNot(None, index._nodes) def test_iter_entries_buffers_by_bytes_read(self): """Test iter entries buffers by bytes read.""" index = self.make_index(nodes=self.make_nodes(64)) list(index.iter_entries([self.make_key(10)])) # The first time through isn't enough to trigger a buffer all self.assertIs(None, index._nodes) self.assertEqual(4096, index._bytes_read) # Grabbing a key in that same page won't trigger a buffer all, as we # still haven't read 50% of the file list(index.iter_entries([self.make_key(11)])) self.assertIs(None, index._nodes) self.assertEqual(4096, index._bytes_read) # We haven't read more data, so reading outside the range won't trigger # a buffer all right away list(index.iter_entries([self.make_key(40)])) self.assertIs(None, index._nodes) self.assertEqual(8192, index._bytes_read) # On the next pass, we will not trigger buffer all if the key is # available without reading more list(index.iter_entries([self.make_key(32)])) self.assertIs(None, index._nodes) # But if we *would* need to read more to resolve it, then we will # buffer all. list(index.iter_entries([self.make_key(60)])) self.assertIsNot(None, index._nodes) def test_iter_entries_references_resolved(self): """Test iter entries references resolved.""" index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",), (b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",), (b"ref",)),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_entries([(b"name",), (b"ref",)])), ) def test_iter_entries_references_2_refs_resolved(self): """Test iter entries references 2 refs resolved.""" index = self.make_index( 2, nodes=[ ((b"name",), b"data", ([(b"ref",)], [(b"ref",)])), ((b"ref",), b"refdata", ([], [])), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),), ((b"ref",),))), (index, (b"ref",), b"refdata", ((), ())), }, set(index.iter_entries([(b"name",), (b"ref",)])), ) def test_iteration_absent_skipped(self): """Test iteration absent skipped.""" index = self.make_index(1, nodes=[((b"name",), b"data", ([(b"ref",)],))]) self.assertEqual( {(index, (b"name",), b"data", (((b"ref",),),))}, set(index.iter_all_entries()), ) self.assertEqual( {(index, (b"name",), b"data", (((b"ref",),),))}, set(index.iter_entries([(b"name",)])), ) self.assertEqual([], list(index.iter_entries([(b"ref",)]))) def test_iteration_absent_skipped_2_element_keys(self): """Test iteration absent skipped 2 element keys.""" index = self.make_index( 1, key_elements=2, nodes=[((b"name", b"fin"), b"data", ([(b"ref", b"erence")],))], ) self.assertEqual( [(index, (b"name", b"fin"), b"data", (((b"ref", b"erence"),),))], list(index.iter_all_entries()), ) self.assertEqual( [(index, (b"name", b"fin"), b"data", (((b"ref", b"erence"),),))], list(index.iter_entries([(b"name", b"fin")])), ) self.assertEqual([], list(index.iter_entries([(b"ref", b"erence")]))) def test_iter_all_keys(self): """Test iter all keys.""" index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_entries([(b"name",), (b"ref",)])), ) def test_iter_nothing_empty(self): """Test iter nothing empty.""" index = self.make_index() self.assertEqual([], list(index.iter_entries([]))) def test_iter_missing_entry_empty(self): """Test iter missing entry empty.""" index = self.make_index() self.assertEqual([], list(index.iter_entries([(b"a",)]))) def test_iter_missing_entry_empty_no_size(self): """Test iter missing entry empty no size.""" idx = self.make_index() idx = _mod_index.GraphIndex(idx._transport, "index", None) self.assertEqual([], list(idx.iter_entries([(b"a",)]))) def test_iter_key_prefix_1_element_key_None(self): """Test iter key prefix 1 element key None.""" index = self.make_index() self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(None,)]) ) def test_iter_key_prefix_wrong_length(self): """Test iter key prefix wrong length.""" index = self.make_index() self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(b"foo", None)]) ) index = self.make_index(key_elements=2) self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(b"foo",)]) ) self.assertRaises( _mod_index.BadIndexKey, list, index.iter_entries_prefix([(b"foo", None, None)]), ) def test_iter_key_prefix_1_key_element_no_refs(self): """Test iter key prefix 1 key element no refs.""" index = self.make_index( nodes=[((b"name",), b"data", ()), ((b"ref",), b"refdata", ())] ) self.assertEqual( {(index, (b"name",), b"data"), (index, (b"ref",), b"refdata")}, set(index.iter_entries_prefix([(b"name",), (b"ref",)])), ) def test_iter_key_prefix_1_key_element_refs(self): """Test iter key prefix 1 key element refs.""" index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_entries_prefix([(b"name",), (b"ref",)])), ) def test_iter_key_prefix_2_key_element_no_refs(self): """Test iter key prefix 2 key element no refs.""" index = self.make_index( key_elements=2, nodes=[ ((b"name", b"fin1"), b"data", ()), ((b"name", b"fin2"), b"beta", ()), ((b"ref", b"erence"), b"refdata", ()), ], ) self.assertEqual( { (index, (b"name", b"fin1"), b"data"), (index, (b"ref", b"erence"), b"refdata"), }, set(index.iter_entries_prefix([(b"name", b"fin1"), (b"ref", b"erence")])), ) self.assertEqual( { (index, (b"name", b"fin1"), b"data"), (index, (b"name", b"fin2"), b"beta"), }, set(index.iter_entries_prefix([(b"name", None)])), ) def test_iter_key_prefix_2_key_element_refs(self): """Test iter key prefix 2 key element refs.""" index = self.make_index( 1, key_elements=2, nodes=[ ((b"name", b"fin1"), b"data", ([(b"ref", b"erence")],)), ((b"name", b"fin2"), b"beta", ([],)), ((b"ref", b"erence"), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (index, (b"ref", b"erence"), b"refdata", ((),)), }, set(index.iter_entries_prefix([(b"name", b"fin1"), (b"ref", b"erence")])), ) self.assertEqual( { (index, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (index, (b"name", b"fin2"), b"beta", ((),)), }, set(index.iter_entries_prefix([(b"name", None)])), ) def test_key_count_empty(self): """Test key count empty.""" index = self.make_index() self.assertEqual(0, index.key_count()) def test_key_count_one(self): """Test key count one.""" index = self.make_index(nodes=[((b"name",), b"", ())]) self.assertEqual(1, index.key_count()) def test_key_count_two(self): """Test key count two.""" index = self.make_index(nodes=[((b"name",), b"", ()), ((b"foo",), b"", ())]) self.assertEqual(2, index.key_count()) def test_read_and_parse_tracks_real_read_value(self): """Test read and parse tracks real read value.""" index = self.make_index(nodes=self.make_nodes(10)) del index._transport._activity[:] index._read_and_parse([(0, 200)]) self.assertEqual( [ ("readv", "index", [(0, 200)], True, index._size), ], index._transport._activity, ) # The readv expansion code will expand the initial request to 4096 # bytes, which is more than enough to read the entire index, and we # will track the fact that we read that many bytes. self.assertEqual(index._size, index._bytes_read) def test_read_and_parse_triggers_buffer_all(self): """Test read and parse triggers buffer all.""" index = self.make_index( key_elements=2, nodes=[ ((b"name", b"fin1"), b"data", ()), ((b"name", b"fin2"), b"beta", ()), ((b"ref", b"erence"), b"refdata", ()), ], ) self.assertGreater(index._size, 0) self.assertIs(None, index._nodes) index._read_and_parse([(0, index._size)]) self.assertIsNot(None, index._nodes) def test_validate_bad_index_errors(self): """Test validate bad index errors.""" trans = self.get_transport() trans.put_bytes("name", b"not an index\n") idx = _mod_index.GraphIndex(trans, "name", 13) self.assertRaises(_mod_index.BadIndexFormatSignature, idx.validate) def test_validate_bad_node_refs(self): """Test validate bad node refs.""" idx = self.make_index(2) trans = self.get_transport() content = trans.get_bytes("index") # change the options line to end with a rather than a parseable number new_content = content[:-2] + b"a\n\n" trans.put_bytes("index", new_content) self.assertRaises(_mod_index.BadIndexOptions, idx.validate) def test_validate_missing_end_line_empty(self): """Test validate missing end line empty.""" index = self.make_index(2) trans = self.get_transport() content = trans.get_bytes("index") # truncate the last byte trans.put_bytes("index", content[:-1]) self.assertRaises(_mod_index.BadIndexData, index.validate) def test_validate_missing_end_line_nonempty(self): """Test validate missing end line nonempty.""" index = self.make_index(2, nodes=[((b"key",), b"", ([], []))]) trans = self.get_transport() content = trans.get_bytes("index") # truncate the last byte trans.put_bytes("index", content[:-1]) self.assertRaises(_mod_index.BadIndexData, index.validate) def test_validate_empty(self): """Test validate empty.""" index = self.make_index() index.validate() def test_validate_no_refs_content(self): """Test validate no refs content.""" index = self.make_index(nodes=[((b"key",), b"value", ())]) index.validate() # XXX: external_references tests are duplicated in test_btree_index. We # probably should have per_graph_index tests... def test_external_references_no_refs(self): """Test external references no refs.""" index = self.make_index(ref_lists=0, nodes=[]) self.assertRaises(ValueError, index.external_references, 0) def test_external_references_no_results(self): """Test external references no results.""" index = self.make_index(ref_lists=1, nodes=[((b"key",), b"value", ([],))]) self.assertEqual(set(), index.external_references(0)) def test_external_references_missing_ref(self): """Test external references missing ref.""" missing_key = (b"missing",) index = self.make_index( ref_lists=1, nodes=[((b"key",), b"value", ([missing_key],))] ) self.assertEqual({missing_key}, index.external_references(0)) def test_external_references_multiple_ref_lists(self): """Test external references multiple ref lists.""" missing_key = (b"missing",) index = self.make_index( ref_lists=2, nodes=[((b"key",), b"value", ([], [missing_key]))] ) self.assertEqual(set(), index.external_references(0)) self.assertEqual({missing_key}, index.external_references(1)) def test_external_references_two_records(self): """Test external references two records.""" index = self.make_index( ref_lists=1, nodes=[ ((b"key-1",), b"value", ([(b"key-2",)],)), ((b"key-2",), b"value", ([],)), ], ) self.assertEqual(set(), index.external_references(0)) def test__find_ancestors(self): """Test find ancestors.""" key1 = (b"key-1",) key2 = (b"key-2",) index = self.make_index( ref_lists=1, key_elements=1, nodes=[ (key1, b"value", ([key2],)), (key2, b"value", ([],)), ], ) parent_map = {} missing_keys = set() search_keys = index._find_ancestors([key1], 0, parent_map, missing_keys) self.assertEqual({key1: (key2,)}, parent_map) self.assertEqual(set(), missing_keys) self.assertEqual({key2}, search_keys) search_keys = index._find_ancestors(search_keys, 0, parent_map, missing_keys) self.assertEqual({key1: (key2,), key2: ()}, parent_map) self.assertEqual(set(), missing_keys) self.assertEqual(set(), search_keys) def test__find_ancestors_w_missing(self): """Test find ancestors w missing.""" key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) index = self.make_index( ref_lists=1, key_elements=1, nodes=[ (key1, b"value", ([key2],)), (key2, b"value", ([],)), ], ) parent_map = {} missing_keys = set() search_keys = index._find_ancestors([key2, key3], 0, parent_map, missing_keys) self.assertEqual({key2: ()}, parent_map) self.assertEqual({key3}, missing_keys) self.assertEqual(set(), search_keys) def test__find_ancestors_dont_search_known(self): """Test find ancestors dont search known.""" key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) index = self.make_index( ref_lists=1, key_elements=1, nodes=[ (key1, b"value", ([key2],)), (key2, b"value", ([key3],)), (key3, b"value", ([],)), ], ) # We already know about key2, so we won't try to search for key3 parent_map = {key2: (key3,)} missing_keys = set() search_keys = index._find_ancestors([key1], 0, parent_map, missing_keys) self.assertEqual({key1: (key2,), key2: (key3,)}, parent_map) self.assertEqual(set(), missing_keys) self.assertEqual(set(), search_keys) def test_supports_unlimited_cache(self): """Test supports unlimited cache.""" builder = _mod_index.GraphIndexBuilder(0, key_elements=1) stream = builder.finish() trans = self.get_transport() size = trans.put_file("index", stream) # It doesn't matter what unlimited_cache does here, just that it can be # passed _mod_index.GraphIndex(trans, "index", size, unlimited_cache=True) class TestCombinedGraphIndex(TestCaseWithMemoryTransport): """Tests for Combined Graph Index.""" def make_index(self, name, ref_lists=0, key_elements=1, nodes=None): """Make index.""" if nodes is None: nodes = [] builder = _mod_index.GraphIndexBuilder(ref_lists, key_elements=key_elements) for key, value, references in nodes: builder.add_node(key, value, references) stream = builder.finish() trans = self.get_transport() size = trans.put_file(name, stream) return _mod_index.GraphIndex(trans, name, size) def make_combined_index_with_missing(self, missing=None): """Create a CombinedGraphIndex which will have missing indexes. This creates a CGI which thinks it has 2 indexes, however they have been deleted. If CGI._reload_func() is called, then it will repopulate with a new index. :param missing: The underlying indexes to delete :return: (CombinedGraphIndex, reload_counter) """ if missing is None: missing = ["1", "2"] idx1 = self.make_index("1", nodes=[((b"1",), b"", ())]) idx2 = self.make_index("2", nodes=[((b"2",), b"", ())]) idx3 = self.make_index("3", nodes=[((b"1",), b"", ()), ((b"2",), b"", ())]) # total_reloads, num_changed, num_unchanged reload_counter = [0, 0, 0] def reload(): reload_counter[0] += 1 new_indices = [idx3] if idx._indices == new_indices: reload_counter[2] += 1 return False reload_counter[1] += 1 idx._indices[:] = new_indices return True idx = _mod_index.CombinedGraphIndex([idx1, idx2], reload_func=reload) trans = self.get_transport() for fname in missing: trans.delete(fname) return idx, reload_counter def test_open_missing_index_no_error(self): """Test open missing index no error.""" trans = self.get_transport() idx1 = _mod_index.GraphIndex(trans, "missing", 100) _mod_index.CombinedGraphIndex([idx1]) def test_add_index(self): """Test add index.""" idx = _mod_index.CombinedGraphIndex([]) idx1 = self.make_index("name", 0, nodes=[((b"key",), b"", ())]) idx.insert_index(0, idx1) self.assertEqual([(idx1, (b"key",), b"")], list(idx.iter_all_entries())) def test_clear_cache(self): """Test clear cache.""" log = [] class ClearCacheProxy: def __init__(self, index): self._index = index def __getattr__(self, name): return getattr(self._index) def clear_cache(self): log.append(self._index) return self._index.clear_cache() idx = _mod_index.CombinedGraphIndex([]) idx1 = self.make_index("name", 0, nodes=[((b"key",), b"", ())]) idx.insert_index(0, ClearCacheProxy(idx1)) idx2 = self.make_index("name", 0, nodes=[((b"key",), b"", ())]) idx.insert_index(1, ClearCacheProxy(idx2)) # CombinedGraphIndex should call 'clear_cache()' on all children idx.clear_cache() self.assertEqual(sorted([idx1, idx2]), sorted(log)) def test_iter_all_entries_empty(self): """Test iter all entries empty.""" idx = _mod_index.CombinedGraphIndex([]) self.assertEqual([], list(idx.iter_all_entries())) def test_iter_all_entries_children_empty(self): """Test iter all entries children empty.""" idx1 = self.make_index("name") idx = _mod_index.CombinedGraphIndex([idx1]) self.assertEqual([], list(idx.iter_all_entries())) def test_iter_all_entries_simple(self): """Test iter all entries simple.""" idx1 = self.make_index("name", nodes=[((b"name",), b"data", ())]) idx = _mod_index.CombinedGraphIndex([idx1]) self.assertEqual([(idx1, (b"name",), b"data")], list(idx.iter_all_entries())) def test_iter_all_entries_two_indices(self): """Test iter all entries two indices.""" idx1 = self.make_index("name1", nodes=[((b"name",), b"data", ())]) idx2 = self.make_index("name2", nodes=[((b"2",), b"", ())]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual( [(idx1, (b"name",), b"data"), (idx2, (b"2",), b"")], list(idx.iter_all_entries()), ) def test_iter_entries_two_indices_dup_key(self): """Test iter entries two indices dup key.""" idx1 = self.make_index("name1", nodes=[((b"name",), b"data", ())]) idx2 = self.make_index("name2", nodes=[((b"name",), b"data", ())]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual( [(idx1, (b"name",), b"data")], list(idx.iter_entries([(b"name",)])) ) def test_iter_all_entries_two_indices_dup_key(self): """Test iter all entries two indices dup key.""" idx1 = self.make_index("name1", nodes=[((b"name",), b"data", ())]) idx2 = self.make_index("name2", nodes=[((b"name",), b"data", ())]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual([(idx1, (b"name",), b"data")], list(idx.iter_all_entries())) def test_iter_key_prefix_2_key_element_refs(self): """Test iter key prefix 2 key element refs.""" idx1 = self.make_index( "1", 1, key_elements=2, nodes=[((b"name", b"fin1"), b"data", ([(b"ref", b"erence")],))], ) idx2 = self.make_index( "2", 1, key_elements=2, nodes=[ ((b"name", b"fin2"), b"beta", ([],)), ((b"ref", b"erence"), b"refdata", ([],)), ], ) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual( { (idx1, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (idx2, (b"ref", b"erence"), b"refdata", ((),)), }, set(idx.iter_entries_prefix([(b"name", b"fin1"), (b"ref", b"erence")])), ) self.assertEqual( { (idx1, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (idx2, (b"name", b"fin2"), b"beta", ((),)), }, set(idx.iter_entries_prefix([(b"name", None)])), ) def test_iter_nothing_empty(self): """Test iter nothing empty.""" idx = _mod_index.CombinedGraphIndex([]) self.assertEqual([], list(idx.iter_entries([]))) def test_iter_nothing_children_empty(self): """Test iter nothing children empty.""" idx1 = self.make_index("name") idx = _mod_index.CombinedGraphIndex([idx1]) self.assertEqual([], list(idx.iter_entries([]))) def test_iter_all_keys(self): """Test iter all keys.""" idx1 = self.make_index("1", 1, nodes=[((b"name",), b"data", ([(b"ref",)],))]) idx2 = self.make_index("2", 1, nodes=[((b"ref",), b"refdata", ((),))]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual( { (idx1, (b"name",), b"data", (((b"ref",),),)), (idx2, (b"ref",), b"refdata", ((),)), }, set(idx.iter_entries([(b"name",), (b"ref",)])), ) def test_iter_all_keys_dup_entry(self): """Test iter all keys dup entry.""" idx1 = self.make_index( "1", 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) idx2 = self.make_index("2", 1, nodes=[((b"ref",), b"refdata", ([],))]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual( { (idx1, (b"name",), b"data", (((b"ref",),),)), (idx1, (b"ref",), b"refdata", ((),)), }, set(idx.iter_entries([(b"name",), (b"ref",)])), ) def test_iter_missing_entry_empty(self): """Test iter missing entry empty.""" idx = _mod_index.CombinedGraphIndex([]) self.assertEqual([], list(idx.iter_entries([("a",)]))) def test_iter_missing_entry_one_index(self): """Test iter missing entry one index.""" idx1 = self.make_index("1") idx = _mod_index.CombinedGraphIndex([idx1]) self.assertEqual([], list(idx.iter_entries([(b"a",)]))) def test_iter_missing_entry_two_index(self): """Test iter missing entry two index.""" idx1 = self.make_index("1") idx2 = self.make_index("2") idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual([], list(idx.iter_entries([("a",)]))) def test_iter_entry_present_one_index_only(self): """Test iter entry present one index only.""" idx1 = self.make_index("1", nodes=[((b"key",), b"", ())]) idx2 = self.make_index("2", nodes=[]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual([(idx1, (b"key",), b"")], list(idx.iter_entries([(b"key",)]))) # and in the other direction idx = _mod_index.CombinedGraphIndex([idx2, idx1]) self.assertEqual([(idx1, (b"key",), b"")], list(idx.iter_entries([(b"key",)]))) def test_key_count_empty(self): """Test key count empty.""" idx1 = self.make_index("1", nodes=[]) idx2 = self.make_index("2", nodes=[]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual(0, idx.key_count()) def test_key_count_sums_index_keys(self): """Test key count sums index keys.""" idx1 = self.make_index("1", nodes=[((b"1",), b"", ()), ((b"2",), b"", ())]) idx2 = self.make_index("2", nodes=[((b"1",), b"", ())]) idx = _mod_index.CombinedGraphIndex([idx1, idx2]) self.assertEqual(3, idx.key_count()) def test_validate_bad_child_index_errors(self): """Test validate bad child index errors.""" trans = self.get_transport() trans.put_bytes("name", b"not an index\n") idx1 = _mod_index.GraphIndex(trans, "name", 13) idx = _mod_index.CombinedGraphIndex([idx1]) self.assertRaises(_mod_index.BadIndexFormatSignature, idx.validate) def test_validate_empty(self): """Test validate empty.""" idx = _mod_index.CombinedGraphIndex([]) idx.validate() def test_key_count_reloads(self): """Test key count reloads.""" idx, reload_counter = self.make_combined_index_with_missing() self.assertEqual(2, idx.key_count()) self.assertEqual([1, 1, 0], reload_counter) def test_key_count_no_reload(self): """Test key count no reload.""" idx, _reload_counter = self.make_combined_index_with_missing() idx._reload_func = None # Without a _reload_func we just raise the exception self.assertRaises(TransportNoSuchFile, idx.key_count) def test_key_count_reloads_and_fails(self): """Test key count reloads and fails.""" # We have deleted all underlying indexes, so we will try to reload, but # still fail. This is mostly to test we don't get stuck in an infinite # loop trying to reload idx, reload_counter = self.make_combined_index_with_missing(["1", "2", "3"]) self.assertRaises(TransportNoSuchFile, idx.key_count) self.assertEqual([2, 1, 1], reload_counter) def test_iter_entries_reloads(self): """Test iter entries reloads.""" index, reload_counter = self.make_combined_index_with_missing() result = list(index.iter_entries([(b"1",), (b"2",), (b"3",)])) index3 = index._indices[0] self.assertEqual({(index3, (b"1",), b""), (index3, (b"2",), b"")}, set(result)) self.assertEqual([1, 1, 0], reload_counter) def test_iter_entries_reloads_midway(self): """Test iter entries reloads midway.""" # The first index still looks present, so we get interrupted mid-way # through index, reload_counter = self.make_combined_index_with_missing(["2"]) index1, _index2 = index._indices result = list(index.iter_entries([(b"1",), (b"2",), (b"3",)])) index3 = index._indices[0] # We had already yielded b'1', so we just go on to the next, we should # not yield b'1' twice. self.assertEqual([(index1, (b"1",), b""), (index3, (b"2",), b"")], result) self.assertEqual([1, 1, 0], reload_counter) def test_iter_entries_no_reload(self): """Test iter entries no reload.""" index, _reload_counter = self.make_combined_index_with_missing() index._reload_func = None # Without a _reload_func we just raise the exception self.assertListRaises(TransportNoSuchFile, index.iter_entries, [("3",)]) def test_iter_entries_reloads_and_fails(self): """Test iter entries reloads and fails.""" index, reload_counter = self.make_combined_index_with_missing(["1", "2", "3"]) self.assertListRaises(TransportNoSuchFile, index.iter_entries, [("3",)]) self.assertEqual([2, 1, 1], reload_counter) def test_iter_all_entries_reloads(self): """Test iter all entries reloads.""" index, reload_counter = self.make_combined_index_with_missing() result = list(index.iter_all_entries()) index3 = index._indices[0] self.assertEqual({(index3, (b"1",), b""), (index3, (b"2",), b"")}, set(result)) self.assertEqual([1, 1, 0], reload_counter) def test_iter_all_entries_reloads_midway(self): """Test iter all entries reloads midway.""" index, reload_counter = self.make_combined_index_with_missing(["2"]) index1, _index2 = index._indices result = list(index.iter_all_entries()) index3 = index._indices[0] # We had already yielded '1', so we just go on to the next, we should # not yield '1' twice. self.assertEqual([(index1, (b"1",), b""), (index3, (b"2",), b"")], result) self.assertEqual([1, 1, 0], reload_counter) def test_iter_all_entries_no_reload(self): """Test iter all entries no reload.""" index, _reload_counter = self.make_combined_index_with_missing() index._reload_func = None self.assertListRaises(TransportNoSuchFile, index.iter_all_entries) def test_iter_all_entries_reloads_and_fails(self): """Test iter all entries reloads and fails.""" index, _reload_counter = self.make_combined_index_with_missing(["1", "2", "3"]) self.assertListRaises(TransportNoSuchFile, index.iter_all_entries) def test_iter_entries_prefix_reloads(self): """Test iter entries prefix reloads.""" index, reload_counter = self.make_combined_index_with_missing() result = list(index.iter_entries_prefix([(b"1",)])) index3 = index._indices[0] self.assertEqual([(index3, (b"1",), b"")], result) self.assertEqual([1, 1, 0], reload_counter) def test_iter_entries_prefix_reloads_midway(self): """Test iter entries prefix reloads midway.""" index, reload_counter = self.make_combined_index_with_missing(["2"]) index1, _index2 = index._indices result = list(index.iter_entries_prefix([(b"1",)])) index._indices[0] # We had already yielded b'1', so we just go on to the next, we should # not yield b'1' twice. self.assertEqual([(index1, (b"1",), b"")], result) self.assertEqual([1, 1, 0], reload_counter) def test_iter_entries_prefix_no_reload(self): """Test iter entries prefix no reload.""" index, _reload_counter = self.make_combined_index_with_missing() index._reload_func = None self.assertListRaises(TransportNoSuchFile, index.iter_entries_prefix, [(b"1",)]) def test_iter_entries_prefix_reloads_and_fails(self): """Test iter entries prefix reloads and fails.""" index, _reload_counter = self.make_combined_index_with_missing(["1", "2", "3"]) self.assertListRaises(TransportNoSuchFile, index.iter_entries_prefix, [(b"1",)]) def make_index_with_simple_nodes(self, name, num_nodes=1): """Make an index named after 'name', with keys named after 'name' too. Nodes will have a value of '' and no references. """ nodes = [ ((f"index-{name}-key-{n}".encode("ascii"),), b"", ()) for n in range(1, num_nodes + 1) ] return self.make_index(f"index-{name}", 0, nodes=nodes) def test_reorder_after_iter_entries(self): """Test reorder after iter entries.""" # Four indices: [key1] in idx1, [key2,key3] in idx2, [] in idx3, # [key4] in idx4. idx = _mod_index.CombinedGraphIndex([]) idx.insert_index(0, self.make_index_with_simple_nodes("1"), b"1") idx.insert_index(1, self.make_index_with_simple_nodes("2"), b"2") idx.insert_index(2, self.make_index_with_simple_nodes("3"), b"3") idx.insert_index(3, self.make_index_with_simple_nodes("4"), b"4") idx1, idx2, idx3, idx4 = idx._indices # Query a key from idx4 and idx2. self.assertLength( 2, list(idx.iter_entries([(b"index-4-key-1",), (b"index-2-key-1",)])) ) # Now idx2 and idx4 should be moved to the front (and idx1 should # still be before idx3). self.assertEqual([idx2, idx4, idx1, idx3], idx._indices) self.assertEqual([b"2", b"4", b"1", b"3"], idx._index_names) def test_reorder_propagates_to_siblings(self): """Test reorder propagates to siblings.""" # Two CombinedGraphIndex objects, with the same number of indicies with # matching names. cgi1 = _mod_index.CombinedGraphIndex([]) cgi2 = _mod_index.CombinedGraphIndex([]) cgi1.insert_index(0, self.make_index_with_simple_nodes("1-1"), "one") cgi1.insert_index(1, self.make_index_with_simple_nodes("1-2"), "two") cgi2.insert_index(0, self.make_index_with_simple_nodes("2-1"), "one") cgi2.insert_index(1, self.make_index_with_simple_nodes("2-2"), "two") index2_1, index2_2 = cgi2._indices cgi1.set_sibling_indices([cgi2]) # Trigger a reordering in cgi1. cgi2 will be reordered as well. list(cgi1.iter_entries([(b"index-1-2-key-1",)])) self.assertEqual([index2_2, index2_1], cgi2._indices) self.assertEqual(["two", "one"], cgi2._index_names) def test_validate_reloads(self): """Test validate reloads.""" idx, reload_counter = self.make_combined_index_with_missing() idx.validate() self.assertEqual([1, 1, 0], reload_counter) def test_validate_reloads_midway(self): """Test validate reloads midway.""" idx, _reload_counter = self.make_combined_index_with_missing(["2"]) idx.validate() def test_validate_no_reload(self): """Test validate no reload.""" idx, _reload_counter = self.make_combined_index_with_missing() idx._reload_func = None self.assertRaises(TransportNoSuchFile, idx.validate) def test_validate_reloads_and_fails(self): """Test validate reloads and fails.""" idx, _reload_counter = self.make_combined_index_with_missing(["1", "2", "3"]) self.assertRaises(TransportNoSuchFile, idx.validate) def test_find_ancestors_across_indexes(self): """Test find ancestors across indexes.""" key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) key4 = (b"key-4",) index1 = self.make_index( "12", ref_lists=1, nodes=[ (key1, b"value", ([],)), (key2, b"value", ([key1],)), ], ) index2 = self.make_index( "34", ref_lists=1, nodes=[ (key3, b"value", ([key2],)), (key4, b"value", ([key3],)), ], ) c_index = _mod_index.CombinedGraphIndex([index1, index2]) parent_map, missing_keys = c_index.find_ancestry([key1], 0) self.assertEqual({key1: ()}, parent_map) self.assertEqual(set(), missing_keys) # Now look for a key from index2 which requires us to find the key in # the second index, and then continue searching for parents in the # first index parent_map, missing_keys = c_index.find_ancestry([key3], 0) self.assertEqual({key1: (), key2: (key1,), key3: (key2,)}, parent_map) self.assertEqual(set(), missing_keys) def test_find_ancestors_missing_keys(self): """Test find ancestors missing keys.""" key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) key4 = (b"key-4",) index1 = self.make_index( "12", ref_lists=1, nodes=[ (key1, b"value", ([],)), (key2, b"value", ([key1],)), ], ) index2 = self.make_index( "34", ref_lists=1, nodes=[ (key3, b"value", ([key2],)), ], ) c_index = _mod_index.CombinedGraphIndex([index1, index2]) # Searching for a key which is actually not present at all should # eventually converge parent_map, missing_keys = c_index.find_ancestry([key4], 0) self.assertEqual({}, parent_map) self.assertEqual({key4}, missing_keys) def test_find_ancestors_no_indexes(self): """Test find ancestors no indexes.""" c_index = _mod_index.CombinedGraphIndex([]) key1 = (b"key-1",) parent_map, missing_keys = c_index.find_ancestry([key1], 0) self.assertEqual({}, parent_map) self.assertEqual({key1}, missing_keys) def test_find_ancestors_ghost_parent(self): """Test find ancestors ghost parent.""" key1 = (b"key-1",) key2 = (b"key-2",) key3 = (b"key-3",) key4 = (b"key-4",) index1 = self.make_index( "12", ref_lists=1, nodes=[ (key1, b"value", ([],)), (key2, b"value", ([key1],)), ], ) index2 = self.make_index( "34", ref_lists=1, nodes=[ (key4, b"value", ([key2, key3],)), ], ) c_index = _mod_index.CombinedGraphIndex([index1, index2]) # Searching for a key which is actually not present at all should # eventually converge parent_map, missing_keys = c_index.find_ancestry([key4], 0) self.assertEqual({key4: (key2, key3), key2: (key1,), key1: ()}, parent_map) self.assertEqual({key3}, missing_keys) def test__find_ancestors_empty_index(self): """Test find ancestors empty index.""" idx = self.make_index("test", ref_lists=1, key_elements=1, nodes=[]) parent_map = {} missing_keys = set() search_keys = idx._find_ancestors( [(b"one",), (b"two",)], 0, parent_map, missing_keys ) self.assertEqual(set(), search_keys) self.assertEqual({}, parent_map) self.assertEqual({(b"one",), (b"two",)}, missing_keys) class TestInMemoryGraphIndex(TestCaseWithMemoryTransport): """Tests for In Memory Graph Index.""" def make_index(self, ref_lists=0, key_elements=1, nodes=None): """Make index.""" if nodes is None: nodes = [] result = _mod_index.InMemoryGraphIndex(ref_lists, key_elements=key_elements) result.add_nodes(nodes) return result def test_add_nodes_no_refs(self): """Test add nodes no refs.""" index = self.make_index(0) index.add_nodes([((b"name",), b"data")]) index.add_nodes([((b"name2",), b""), ((b"name3",), b"")]) self.assertEqual( { (index, (b"name",), b"data"), (index, (b"name2",), b""), (index, (b"name3",), b""), }, set(index.iter_all_entries()), ) def test_add_nodes(self): """Test add nodes.""" index = self.make_index(1) index.add_nodes([((b"name",), b"data", ([],))]) index.add_nodes([((b"name2",), b"", ([],)), ((b"name3",), b"", ([(b"r",)],))]) self.assertEqual( { (index, (b"name",), b"data", ((),)), (index, (b"name2",), b"", ((),)), (index, (b"name3",), b"", (((b"r",),),)), }, set(index.iter_all_entries()), ) def test_iter_all_entries_empty(self): """Test iter all entries empty.""" index = self.make_index() self.assertEqual([], list(index.iter_all_entries())) def test_iter_all_entries_simple(self): """Test iter all entries simple.""" index = self.make_index(nodes=[((b"name",), b"data")]) self.assertEqual([(index, (b"name",), b"data")], list(index.iter_all_entries())) def test_iter_all_entries_references(self): """Test iter all entries references.""" index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_all_entries()), ) def test_iteration_absent_skipped(self): """Test iteration absent skipped.""" index = self.make_index(1, nodes=[((b"name",), b"data", ([(b"ref",)],))]) self.assertEqual( {(index, (b"name",), b"data", (((b"ref",),),))}, set(index.iter_all_entries()), ) self.assertEqual( {(index, (b"name",), b"data", (((b"ref",),),))}, set(index.iter_entries([(b"name",)])), ) self.assertEqual([], list(index.iter_entries([(b"ref",)]))) def test_iter_all_keys(self): """Test iter all keys.""" index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_entries([(b"name",), (b"ref",)])), ) def test_iter_key_prefix_1_key_element_no_refs(self): """Test iter key prefix 1 key element no refs.""" index = self.make_index(nodes=[((b"name",), b"data"), ((b"ref",), b"refdata")]) self.assertEqual( {(index, (b"name",), b"data"), (index, (b"ref",), b"refdata")}, set(index.iter_entries_prefix([(b"name",), (b"ref",)])), ) def test_iter_key_prefix_1_key_element_refs(self): """Test iter key prefix 1 key element refs.""" index = self.make_index( 1, nodes=[ ((b"name",), b"data", ([(b"ref",)],)), ((b"ref",), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name",), b"data", (((b"ref",),),)), (index, (b"ref",), b"refdata", ((),)), }, set(index.iter_entries_prefix([(b"name",), (b"ref",)])), ) def test_iter_key_prefix_2_key_element_no_refs(self): """Test iter key prefix 2 key element no refs.""" index = self.make_index( key_elements=2, nodes=[ ((b"name", b"fin1"), b"data"), ((b"name", b"fin2"), b"beta"), ((b"ref", b"erence"), b"refdata"), ], ) self.assertEqual( { (index, (b"name", b"fin1"), b"data"), (index, (b"ref", b"erence"), b"refdata"), }, set(index.iter_entries_prefix([(b"name", b"fin1"), (b"ref", b"erence")])), ) self.assertEqual( { (index, (b"name", b"fin1"), b"data"), (index, (b"name", b"fin2"), b"beta"), }, set(index.iter_entries_prefix([(b"name", None)])), ) def test_iter_key_prefix_2_key_element_refs(self): """Test iter key prefix 2 key element refs.""" index = self.make_index( 1, key_elements=2, nodes=[ ((b"name", b"fin1"), b"data", ([(b"ref", b"erence")],)), ((b"name", b"fin2"), b"beta", ([],)), ((b"ref", b"erence"), b"refdata", ([],)), ], ) self.assertEqual( { (index, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (index, (b"ref", b"erence"), b"refdata", ((),)), }, set(index.iter_entries_prefix([(b"name", b"fin1"), (b"ref", b"erence")])), ) self.assertEqual( { (index, (b"name", b"fin1"), b"data", (((b"ref", b"erence"),),)), (index, (b"name", b"fin2"), b"beta", ((),)), }, set(index.iter_entries_prefix([(b"name", None)])), ) def test_iter_nothing_empty(self): """Test iter nothing empty.""" index = self.make_index() self.assertEqual([], list(index.iter_entries([]))) def test_iter_missing_entry_empty(self): """Test iter missing entry empty.""" index = self.make_index() self.assertEqual([], list(index.iter_entries([b"a"]))) def test_key_count_empty(self): """Test key count empty.""" index = self.make_index() self.assertEqual(0, index.key_count()) def test_key_count_one(self): """Test key count one.""" index = self.make_index(nodes=[((b"name",), b"")]) self.assertEqual(1, index.key_count()) def test_key_count_two(self): """Test key count two.""" index = self.make_index(nodes=[((b"name",), b""), ((b"foo",), b"")]) self.assertEqual(2, index.key_count()) def test_validate_empty(self): """Test validate empty.""" index = self.make_index() index.validate() def test_validate_no_refs_content(self): """Test validate no refs content.""" index = self.make_index(nodes=[((b"key",), b"value")]) index.validate() class TestGraphIndexPrefixAdapter(TestCaseWithMemoryTransport): """Tests for Graph Index Prefix Adapter.""" def make_index(self, ref_lists=1, key_elements=2, nodes=None, add_callback=False): """Make index.""" if nodes is None: nodes = [] result = _mod_index.InMemoryGraphIndex(ref_lists, key_elements=key_elements) result.add_nodes(nodes) add_nodes_callback = result.add_nodes if add_callback else None adapter = _mod_index.GraphIndexPrefixAdapter( result, (b"prefix",), key_elements - 1, add_nodes_callback=add_nodes_callback, ) return result, adapter def test_add_node(self): """Test add node.""" index, adapter = self.make_index(add_callback=True) adapter.add_node((b"key",), b"value", (((b"ref",),),)) self.assertEqual( {(index, (b"prefix", b"key"), b"value", (((b"prefix", b"ref"),),))}, set(index.iter_all_entries()), ) def test_add_nodes(self): """Test add nodes.""" index, adapter = self.make_index(add_callback=True) adapter.add_nodes( ( ((b"key",), b"value", (((b"ref",),),)), ((b"key2",), b"value2", ((),)), ) ) self.assertEqual( { (index, (b"prefix", b"key2"), b"value2", ((),)), (index, (b"prefix", b"key"), b"value", (((b"prefix", b"ref"),),)), }, set(index.iter_all_entries()), ) def test_construct(self): """Test construct.""" idx = _mod_index.InMemoryGraphIndex() _mod_index.GraphIndexPrefixAdapter(idx, (b"prefix",), 1) def test_construct_with_callback(self): """Test construct with callback.""" idx = _mod_index.InMemoryGraphIndex() _mod_index.GraphIndexPrefixAdapter(idx, (b"prefix",), 1, idx.add_nodes) def test_iter_all_entries_cross_prefix_map_errors(self): """Test iter all entries cross prefix map errors.""" _index, adapter = self.make_index( nodes=[((b"prefix", b"key1"), b"data1", (((b"prefixaltered", b"key2"),),))] ) self.assertRaises(_mod_index.BadIndexData, list, adapter.iter_all_entries()) def test_iter_all_entries(self): """Test iter all entries.""" index, adapter = self.make_index( nodes=[ ((b"notprefix", b"key1"), b"data", ((),)), ((b"prefix", b"key1"), b"data1", ((),)), ((b"prefix", b"key2"), b"data2", (((b"prefix", b"key1"),),)), ] ) self.assertEqual( { (index, (b"key1",), b"data1", ((),)), (index, (b"key2",), b"data2", (((b"key1",),),)), }, set(adapter.iter_all_entries()), ) def test_iter_entries(self): """Test iter entries.""" index, adapter = self.make_index( nodes=[ ((b"notprefix", b"key1"), b"data", ((),)), ((b"prefix", b"key1"), b"data1", ((),)), ((b"prefix", b"key2"), b"data2", (((b"prefix", b"key1"),),)), ] ) # ask for many - get all self.assertEqual( { (index, (b"key1",), b"data1", ((),)), (index, (b"key2",), b"data2", (((b"key1",),),)), }, set(adapter.iter_entries([(b"key1",), (b"key2",)])), ) # ask for one, get one self.assertEqual( {(index, (b"key1",), b"data1", ((),))}, set(adapter.iter_entries([(b"key1",)])), ) # ask for missing, get none self.assertEqual(set(), set(adapter.iter_entries([(b"key3",)]))) def test_iter_entries_prefix(self): """Test iter entries prefix.""" index, adapter = self.make_index( key_elements=3, nodes=[ ((b"notprefix", b"foo", b"key1"), b"data", ((),)), ((b"prefix", b"prefix2", b"key1"), b"data1", ((),)), ( (b"prefix", b"prefix2", b"key2"), b"data2", (((b"prefix", b"prefix2", b"key1"),),), ), ], ) # ask for a prefix, get the results for just that prefix, adjusted. self.assertEqual( { ( index, ( b"prefix2", b"key1", ), b"data1", ((),), ), ( index, ( b"prefix2", b"key2", ), b"data2", ( ( ( b"prefix2", b"key1", ), ), ), ), }, set(adapter.iter_entries_prefix([(b"prefix2", None)])), ) def test_key_count_no_matching_keys(self): """Test key count no matching keys.""" _index, adapter = self.make_index( nodes=[((b"notprefix", b"key1"), b"data", ((),))] ) self.assertEqual(0, adapter.key_count()) def test_key_count_some_keys(self): """Test key count some keys.""" _index, adapter = self.make_index( nodes=[ ((b"notprefix", b"key1"), b"data", ((),)), ((b"prefix", b"key1"), b"data1", ((),)), ((b"prefix", b"key2"), b"data2", (((b"prefix", b"key1"),),)), ] ) self.assertEqual(2, adapter.key_count()) def test_validate(self): """Test validate.""" index, adapter = self.make_index() calls = [] def validate(): calls.append("called") index.validate = validate adapter.validate() self.assertEqual(["called"], calls) bzrformats_3.4.0.orig/bzrformats/tests/test_inv.py0000644000000000000000000013104715162115103017417 0ustar00# Copyright (C) 2005-2012, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA from bzrformats import chk_map, groupcompress, inventory, osutils from bzrformats import errors as bzrformats_errors from bzrformats.inventory import ( ROOT_ID, CHKInventory, DuplicateFileId, InvalidEntryName, Inventory, InventoryDirectory, InventoryEntry, InventoryFile, NoSuchId, TreeReference, _chk_inventory_bytes_to_entry, _chk_inventory_entry_to_bytes, chk_inventory_bytes_to_utf8name_key, ) from ..inventory_delta import InventoryDelta from . import TestCase, TestCaseWithMemoryTransport class TestInventoryUpdates(TestCase): def test_creation_from_root_id(self): # iff a root id is passed to the constructor, a root directory is made inv = inventory.Inventory(root_id=b"tree-root") self.assertNotEqual(None, inv.root) self.assertEqual(b"tree-root", inv.root.file_id) def test_add_path_of_root(self): # if no root id is given at creation time, there is no root directory inv = inventory.Inventory(root_id=None) self.assertIs(None, inv.root) # add a root entry by adding its path ie = inv.add_path("", "directory", b"my-root", revision=b"test-rev") self.assertEqual(b"my-root", ie.file_id) self.assertEqual(ie, inv.root) def test_add_path(self): inv = inventory.Inventory(root_id=b"tree_root") ie = inv.add_path("hello", "file", b"hello-id") self.assertEqual(b"hello-id", ie.file_id) self.assertEqual("file", ie.kind) def test_copy(self): """Make sure copy() works and creates a deep copy.""" inv = inventory.Inventory(root_id=b"some-tree-root") inv.add_path("hello", "file", b"hello-id") inv2 = inv.copy() inv.rename_id(b"some-tree-root", b"some-new-root") self.assertEqual(b"some-tree-root", inv2.root.file_id) self.assertEqual("hello", inv2.get_entry(b"hello-id").name) def test_copy_empty(self): """Make sure an empty inventory can be copied.""" inv = inventory.Inventory(root_id=None) inv2 = inv.copy() self.assertIs(None, inv2.root) def test_copy_copies_root_revision(self): """Make sure the revision of the root gets copied.""" inv = inventory.Inventory(root_id=b"someroot", root_revision=b"therev") inv2 = inv.copy() self.assertEqual(b"someroot", inv2.root.file_id) self.assertEqual(b"therev", inv2.root.revision) def test_create_tree_reference(self): inv = inventory.Inventory(b"tree-root-123") inv.add( TreeReference( b"nested-id", "nested", parent_id=b"tree-root-123", revision=b"rev", reference_revision=b"rev2", ) ) def test_error_encoding(self): inv = inventory.Inventory(b"tree-root") inv.add(InventoryFile(b"a-id", "\u1234", b"tree-root")) from bzrformats.errors import AlreadyVersionedError e = self.assertRaises( AlreadyVersionedError, inv.add, InventoryFile(b"b-id", "\u1234", b"tree-root"), ) self.assertContainsRe(str(e), "\\u1234") def test_add_recursive(self): parent = InventoryDirectory(b"src-id", "src", b"tree-root") child = InventoryFile(b"hello-id", "hello.c", b"src-id") inv = inventory.Inventory(b"tree-root") inv.add(parent) inv.add(child) self.assertEqual("src/hello.c", inv.id2path(b"hello-id")) class TestInventoryEntry(TestCase): def test_file_invalid_entry_name(self): self.assertRaises( InvalidEntryName, inventory.InventoryFile, b"123", "a/hello.c", ROOT_ID ) def test_file_backslash(self): file = inventory.InventoryFile(b"123", "h\\ello.c", ROOT_ID) self.assertEqual(file.name, "h\\ello.c") def test_file_kind_character(self): file = inventory.InventoryFile(b"123", "hello.c", ROOT_ID) self.assertEqual(file.kind_character(), "") def test_dir_kind_character(self): dir = inventory.InventoryDirectory(b"123", "hello.c", ROOT_ID) self.assertEqual(dir.kind_character(), "/") def test_link_kind_character(self): dir = inventory.InventoryLink(b"123", "hello.c", ROOT_ID) self.assertEqual(dir.kind_character(), "@") def test_tree_ref_kind_character(self): dir = TreeReference(b"123", "hello.c", ROOT_ID) self.assertEqual(dir.kind_character(), "+") def test_dir_detect_changes(self): left = inventory.InventoryDirectory(b"123", "hello.c", ROOT_ID) right = inventory.InventoryDirectory(b"123", "hello.c", ROOT_ID) self.assertEqual((False, False), left.detect_changes(right)) self.assertEqual((False, False), right.detect_changes(left)) def test_file_detect_changes(self): left = inventory.InventoryFile(b"123", "hello.c", ROOT_ID, text_sha1=b"123") right = inventory.InventoryFile(b"123", "hello.c", ROOT_ID, text_sha1=b"123") self.assertEqual((False, False), left.detect_changes(right)) self.assertEqual((False, False), right.detect_changes(left)) left = inventory.InventoryFile( b"123", "hello.c", ROOT_ID, text_sha1=b"123", executable=True ) self.assertEqual((False, True), left.detect_changes(right)) self.assertEqual((False, True), right.detect_changes(left)) right = inventory.InventoryFile(b"123", "hello.c", ROOT_ID, text_sha1=b"321") self.assertEqual((True, True), left.detect_changes(right)) self.assertEqual((True, True), right.detect_changes(left)) def test_symlink_detect_changes(self): left = inventory.InventoryLink(b"123", "hello.c", ROOT_ID, symlink_target="foo") right = inventory.InventoryLink( b"123", "hello.c", ROOT_ID, symlink_target="foo" ) self.assertEqual((False, False), left.detect_changes(right)) self.assertEqual((False, False), right.detect_changes(left)) left = inventory.InventoryLink( b"123", "hello.c", ROOT_ID, symlink_target="different" ) self.assertEqual((True, False), left.detect_changes(right)) self.assertEqual((True, False), right.detect_changes(left)) def test_file_has_text(self): file = inventory.InventoryFile(b"123", "hello.c", ROOT_ID) self.assertTrue(file.has_text()) def test_directory_has_text(self): dir = inventory.InventoryDirectory(b"123", "hello.c", ROOT_ID) self.assertFalse(dir.has_text()) def test_link_has_text(self): link = inventory.InventoryLink(b"123", "hello.c", ROOT_ID) self.assertFalse(link.has_text()) def test_make_entry(self): self.assertIsInstance( inventory.make_entry("file", "name", ROOT_ID), inventory.InventoryFile ) self.assertIsInstance( inventory.make_entry("symlink", "name", ROOT_ID), inventory.InventoryLink ) self.assertIsInstance( inventory.make_entry("directory", "name", ROOT_ID), inventory.InventoryDirectory, ) def test_make_entry_non_normalized(self): if osutils.normalizes_filenames(): entry = inventory.make_entry("file", "a\u030a", ROOT_ID) self.assertEqual("\xe5", entry.name) self.assertIsInstance(entry, inventory.InventoryFile) else: self.assertRaises( bzrformats_errors.InvalidNormalization, inventory.make_entry, "file", "a\u030a", ROOT_ID, ) class TestDescribeChanges(TestCase): def test_describe_change(self): # we need to test the following change combinations: # rename # reparent # modify # gone # added # renamed/reparented and modified # change kind (perhaps can't be done yet?) # also, merged in combination with all of these? old_a = InventoryFile( b"a-id", "a_file", ROOT_ID, text_sha1=b"123132", text_size=0 ) new_a = InventoryFile( b"a-id", "a_file", ROOT_ID, text_sha1=b"123132", text_size=0 ) self.assertChangeDescription("unchanged", old_a, new_a) new_a = InventoryFile( b"a-id", "a_file", ROOT_ID, text_sha1=b"abcabc", text_size=10 ) self.assertChangeDescription("modified", old_a, new_a) self.assertChangeDescription("added", None, new_a) self.assertChangeDescription("removed", old_a, None) # perhaps a bit questionable but seems like the most reasonable thing... self.assertChangeDescription("unchanged", None, None) # in this case it's both renamed and modified; show a rename and # modification: new_a = InventoryFile( b"a-id", "newfilename", ROOT_ID, text_sha1=b"abcabc", text_size=10 ) self.assertChangeDescription("modified and renamed", old_a, new_a) # reparenting is 'renaming' new_a = InventoryFile( b"a-id", old_a.name, b"somedir-id", text_sha1=b"abcabc", text_size=10 ) self.assertChangeDescription("modified and renamed", old_a, new_a) # reset the content values so its not modified new_a = InventoryFile( b"a-id", "newfilename", b"somedir-id", text_size=old_a.text_size, text_sha1=old_a.text_sha1, ) self.assertChangeDescription("renamed", old_a, new_a) # reparenting is 'renaming' new_a = InventoryFile( b"a-id", old_a.name, b"somedir-id", text_size=old_a.text_size, text_sha1=old_a.text_sha1, ) self.assertChangeDescription("renamed", old_a, new_a) def assertChangeDescription(self, expected_change, old_ie, new_ie): change = InventoryEntry.describe_change(old_ie, new_ie) self.assertEqual(expected_change, change) class TestCHKInventory(TestCaseWithMemoryTransport): def get_chk_bytes(self): factory = groupcompress.make_pack_factory(True, True, 1) trans = self.get_transport("") return factory(trans) def read_bytes(self, chk_bytes, key): stream = chk_bytes.get_record_stream([key], "unordered", True) return next(stream).get_bytes_as("fulltext") def test_deserialise_gives_CHKInventory(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() new_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) self.assertEqual(b"revid", new_inv.revision_id) self.assertEqual("directory", new_inv.root.kind) self.assertEqual(inv.root.file_id, new_inv.root.file_id) self.assertEqual(inv.root.parent_id, new_inv.root.parent_id) self.assertEqual(inv.root.name, new_inv.root.name) self.assertEqual(b"rootrev", new_inv.root.revision) self.assertEqual(b"plain", new_inv._search_key_name) def test_deserialise_wrong_revid(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() self.assertRaises( ValueError, CHKInventory.deserialise, chk_bytes, lines, (b"revid2",) ) def test_captures_rev_root_byid(self): inv = Inventory(revision_id=b"foo", root_revision=b"bar") chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() self.assertEqual( [ b"chkinventory:\n", b"revision_id: foo\n", b"root_id: TREE_ROOT\n", b"parent_id_basename_to_file_id: sha1:eb23f0ad4b07f48e88c76d4c94292be57fb2785f\n", b"id_to_entry: sha1:debfe920f1f10e7929260f0534ac9a24d7aabbb4\n", ], lines, ) chk_inv = CHKInventory.deserialise(chk_bytes, lines, (b"foo",)) self.assertEqual(b"plain", chk_inv._search_key_name) def test_captures_parent_id_basename_index(self): inv = Inventory(revision_id=b"foo", root_revision=b"bar") chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() self.assertEqual( [ b"chkinventory:\n", b"revision_id: foo\n", b"root_id: TREE_ROOT\n", b"parent_id_basename_to_file_id: sha1:eb23f0ad4b07f48e88c76d4c94292be57fb2785f\n", b"id_to_entry: sha1:debfe920f1f10e7929260f0534ac9a24d7aabbb4\n", ], lines, ) chk_inv = CHKInventory.deserialise(chk_bytes, lines, (b"foo",)) self.assertEqual(b"plain", chk_inv._search_key_name) def test_captures_search_key_name(self): inv = Inventory(revision_id=b"foo", root_revision=b"bar") chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory( chk_bytes, inv, search_key_name=b"hash-16-way" ) lines = chk_inv.to_lines() self.assertEqual( [ b"chkinventory:\n", b"search_key_name: hash-16-way\n", b"root_id: TREE_ROOT\n", b"parent_id_basename_to_file_id: sha1:eb23f0ad4b07f48e88c76d4c94292be57fb2785f\n", b"revision_id: foo\n", b"id_to_entry: sha1:debfe920f1f10e7929260f0534ac9a24d7aabbb4\n", ], lines, ) chk_inv = CHKInventory.deserialise(chk_bytes, lines, (b"foo",)) self.assertEqual(b"hash-16-way", chk_inv._search_key_name) def test_directory_children_on_demand(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") inv.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() new_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) root_entry = new_inv.get_entry(inv.root.file_id) self.assertEqual({"file"}, set(inv.get_children(root_entry.file_id))) file_direct = new_inv.get_entry(b"fileid") file_found = inv.get_children(root_entry.file_id)["file"] self.assertEqual(file_direct.kind, file_found.kind) self.assertEqual(file_direct.file_id, file_found.file_id) self.assertEqual(file_direct.parent_id, file_found.parent_id) self.assertEqual(file_direct.name, file_found.name) self.assertEqual(file_direct.revision, file_found.revision) self.assertEqual(file_direct.text_sha1, file_found.text_sha1) self.assertEqual(file_direct.text_size, file_found.text_size) self.assertEqual(file_direct.executable, file_found.executable) def test_from_inventory_maximum_size(self): # from_inventory supports the maximum_size parameter. inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv, 120) chk_inv.id_to_entry._ensure_root() self.assertEqual(120, chk_inv.id_to_entry._root_node.maximum_size) self.assertEqual(1, chk_inv.id_to_entry._root_node._key_width) p_id_basename = chk_inv.parent_id_basename_to_file_id p_id_basename._ensure_root() self.assertEqual(120, p_id_basename._root_node.maximum_size) self.assertEqual(2, p_id_basename._root_node._key_width) def test_iter_all_ids(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") inv.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() new_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) fileids = sorted(new_inv.iter_all_ids()) self.assertEqual([inv.root.file_id, b"fileid"], fileids) def test__len__(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") inv.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) self.assertEqual(2, len(chk_inv)) def test_get_entry(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") inv.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() new_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) root_entry = new_inv.get_entry(inv.root.file_id) file_entry = new_inv.get_entry(b"fileid") self.assertEqual("directory", root_entry.kind) self.assertEqual(inv.root.file_id, root_entry.file_id) self.assertEqual(inv.root.parent_id, root_entry.parent_id) self.assertEqual(inv.root.name, root_entry.name) self.assertEqual(b"rootrev", root_entry.revision) self.assertEqual("file", file_entry.kind) self.assertEqual(b"fileid", file_entry.file_id) self.assertEqual(inv.root.file_id, file_entry.parent_id) self.assertEqual("file", file_entry.name) self.assertEqual(b"filerev", file_entry.revision) self.assertEqual(b"ffff", file_entry.text_sha1) self.assertEqual(1, file_entry.text_size) self.assertEqual(True, file_entry.executable) self.assertRaises(NoSuchId, new_inv.get_entry, "missing") def test_has_id_true(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") inv.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) self.assertTrue(chk_inv.has_id(b"fileid")) self.assertTrue(chk_inv.has_id(inv.root.file_id)) def test_has_id_not(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) self.assertFalse(chk_inv.has_id(b"fileid")) def test_id2path(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") direntry = InventoryDirectory( b"dirid", "dir", inv.root.file_id, revision=b"filerev" ) fileentry = InventoryFile( b"fileid", "file", b"dirid", revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) inv.add(direntry) inv.add(fileentry) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() new_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) self.assertEqual("", new_inv.id2path(inv.root.file_id)) self.assertEqual("dir", new_inv.id2path(b"dirid")) self.assertEqual("dir/file", new_inv.id2path(b"fileid")) def test_path2id(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") direntry = InventoryDirectory( b"dirid", "dir", inv.root.file_id, revision=b"filerev" ) fileentry = InventoryFile( b"fileid", "file", b"dirid", revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) inv.add(direntry) inv.add(fileentry) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() new_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) self.assertEqual(inv.root.file_id, new_inv.path2id("")) self.assertEqual(b"dirid", new_inv.path2id("dir")) self.assertEqual(b"fileid", new_inv.path2id("dir/file")) def test_create_by_apply_delta_sets_root(self): inv = Inventory(root_revision=b"myrootrev", revision_id=b"revid") chk_bytes = self.get_chk_bytes() base_inv = CHKInventory.from_inventory(chk_bytes, inv) inv.revision_id = b"expectedid" inv.add_path("", "directory", b"myrootid", revision=b"myrootrev") reference_inv = CHKInventory.from_inventory(chk_bytes, inv) delta = InventoryDelta( [("", None, base_inv.root.file_id, None), (None, "", b"myrootid", inv.root)] ) new_inv = base_inv.create_by_apply_delta(delta, b"expectedid") self.assertEqual(reference_inv.root, new_inv.root) def test_create_by_apply_delta_empty_add_child(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") chk_bytes = self.get_chk_bytes() base_inv = CHKInventory.from_inventory(chk_bytes, inv) a_entry = InventoryFile( b"A-id", "A", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) inv.add(a_entry) inv.revision_id = b"expectedid" reference_inv = CHKInventory.from_inventory(chk_bytes, inv) delta = InventoryDelta([(None, "A", b"A-id", a_entry)]) new_inv = base_inv.create_by_apply_delta(delta, b"expectedid") # new_inv should be the same as reference_inv. self.assertEqual(reference_inv.revision_id, new_inv.revision_id) self.assertEqual(reference_inv.root_id, new_inv.root_id) reference_inv.id_to_entry._ensure_root() new_inv.id_to_entry._ensure_root() self.assertEqual( reference_inv.id_to_entry._root_node._key, new_inv.id_to_entry._root_node._key, ) def test_create_by_apply_delta_empty_add_child_updates_parent_id(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") chk_bytes = self.get_chk_bytes() base_inv = CHKInventory.from_inventory(chk_bytes, inv) a_entry = InventoryFile( b"A-id", "A", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) inv.add(a_entry) inv.revision_id = b"expectedid" reference_inv = CHKInventory.from_inventory(chk_bytes, inv) delta = InventoryDelta([(None, "A", b"A-id", a_entry)]) new_inv = base_inv.create_by_apply_delta(delta, b"expectedid") reference_inv.id_to_entry._ensure_root() reference_inv.parent_id_basename_to_file_id._ensure_root() new_inv.id_to_entry._ensure_root() new_inv.parent_id_basename_to_file_id._ensure_root() # new_inv should be the same as reference_inv. self.assertEqual(reference_inv.revision_id, new_inv.revision_id) self.assertEqual(reference_inv.root_id, new_inv.root_id) self.assertEqual( reference_inv.id_to_entry._root_node._key, new_inv.id_to_entry._root_node._key, ) self.assertEqual( reference_inv.parent_id_basename_to_file_id._root_node._key, new_inv.parent_id_basename_to_file_id._root_node._key, ) def test_iter_changes(self): # Low level bootstrapping smoke test; comprehensive generic tests via # InterTree are coming. inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") inv.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) inv2 = Inventory(revision_id=b"revid2", root_revision=b"rootrev") inv2.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev2", executable=False, text_sha1=b"bbbb", text_size=2, ) ) # get fresh objects. chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() inv_1 = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) chk_inv2 = CHKInventory.from_inventory(chk_bytes, inv2) lines = chk_inv2.to_lines() inv_2 = CHKInventory.deserialise(chk_bytes, lines, (b"revid2",)) self.assertEqual( [ ( b"fileid", ("file", "file"), True, (True, True), (b"TREE_ROOT", b"TREE_ROOT"), ("file", "file"), ("file", "file"), (False, True), ) ], list(inv_1.iter_changes(inv_2)), ) def test_parent_id_basename_to_file_id_index_enabled(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") inv.add( InventoryFile( b"fileid", "file", inv.root.file_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) # get fresh objects. chk_bytes = self.get_chk_bytes() tmp_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = tmp_inv.to_lines() chk_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) self.assertIsInstance(chk_inv.parent_id_basename_to_file_id, chk_map.CHKMap) self.assertEqual( {(b"", b""): b"TREE_ROOT", (b"TREE_ROOT", b"file"): b"fileid"}, dict(chk_inv.parent_id_basename_to_file_id.iteritems()), ) def test_file_entry_to_bytes(self): CHKInventory(None) ie = inventory.InventoryFile( b"file-id", "filename", b"parent-id", executable=True, revision=b"file-rev-id", text_sha1=b"abcdefgh", text_size=100, ) bytes = _chk_inventory_entry_to_bytes(ie) self.assertEqual( b"file: file-id\nparent-id\nfilename\nfile-rev-id\nabcdefgh\n100\nY", bytes, ) ie2 = _chk_inventory_bytes_to_entry(bytes) self.assertEqual(ie, ie2) self.assertIsInstance(ie2.name, str) self.assertEqual( (b"filename", b"file-id", b"file-rev-id"), chk_inventory_bytes_to_utf8name_key(bytes), ) def test_file2_entry_to_bytes(self): CHKInventory(None) # \u30a9 == 'omega' ie = inventory.InventoryFile( b"file-id", "\u03a9name", b"parent-id", executable=False, revision=b"file-rev-id", text_sha1=b"123456", text_size=25, ) bytes = _chk_inventory_entry_to_bytes(ie) self.assertEqual( b"file: file-id\nparent-id\n\xce\xa9name\nfile-rev-id\n123456\n25\nN", bytes, ) ie2 = _chk_inventory_bytes_to_entry(bytes) self.assertEqual(ie, ie2) self.assertIsInstance(ie2.name, str) self.assertEqual( (b"\xce\xa9name", b"file-id", b"file-rev-id"), chk_inventory_bytes_to_utf8name_key(bytes), ) def test_dir_entry_to_bytes(self): CHKInventory(None) ie = inventory.InventoryDirectory( b"dir-id", "dirname", b"parent-id", revision=b"dir-rev-id" ) bytes = _chk_inventory_entry_to_bytes(ie) self.assertEqual(b"dir: dir-id\nparent-id\ndirname\ndir-rev-id", bytes) ie2 = _chk_inventory_bytes_to_entry(bytes) self.assertEqual(ie, ie2) self.assertIsInstance(ie2.name, str) self.assertEqual( (b"dirname", b"dir-id", b"dir-rev-id"), chk_inventory_bytes_to_utf8name_key(bytes), ) def test_dir2_entry_to_bytes(self): CHKInventory(None) ie = inventory.InventoryDirectory( b"dir-id", "dir\u03a9name", b"pid", revision=b"dir-rev-id" ) bytes = _chk_inventory_entry_to_bytes(ie) self.assertEqual(b"dir: dir-id\npid\ndir\xce\xa9name\ndir-rev-id", bytes) ie2 = _chk_inventory_bytes_to_entry(bytes) self.assertEqual(ie, ie2) self.assertIsInstance(ie2.name, str) self.assertEqual(b"pid", ie2.parent_id) self.assertEqual( (b"dir\xce\xa9name", b"dir-id", b"dir-rev-id"), chk_inventory_bytes_to_utf8name_key(bytes), ) def test_symlink_entry_to_bytes(self): CHKInventory(None) ie = inventory.InventoryLink( b"link-id", "linkname", b"parent-id", revision=b"link-rev-id", symlink_target="target/path", ) bytes = _chk_inventory_entry_to_bytes(ie) self.assertEqual( b"symlink: link-id\nparent-id\nlinkname\nlink-rev-id\ntarget/path", bytes, ) ie2 = _chk_inventory_bytes_to_entry(bytes) self.assertEqual(ie, ie2) self.assertIsInstance(ie2.name, str) self.assertIsInstance(ie2.symlink_target, str) self.assertEqual( (b"linkname", b"link-id", b"link-rev-id"), chk_inventory_bytes_to_utf8name_key(bytes), ) def test_symlink2_entry_to_bytes(self): CHKInventory(None) ie = inventory.InventoryLink( b"link-id", "link\u03a9name", b"parent-id", revision=b"link-rev-id", symlink_target="target/\u03a9path", ) bytes = _chk_inventory_entry_to_bytes(ie) self.assertEqual( b"symlink: link-id\nparent-id\nlink\xce\xa9name\n" b"link-rev-id\ntarget/\xce\xa9path", bytes, ) ie2 = _chk_inventory_bytes_to_entry(bytes) self.assertEqual(ie, ie2) self.assertIsInstance(ie2.name, str) self.assertIsInstance(ie2.symlink_target, str) self.assertEqual( (b"link\xce\xa9name", b"link-id", b"link-rev-id"), chk_inventory_bytes_to_utf8name_key(bytes), ) def test_tree_reference_entry_to_bytes(self): CHKInventory(None) ie = inventory.TreeReference( b"tree-root-id", "tree\u03a9name", b"parent-id", revision=b"tree-rev-id", reference_revision=b"ref-rev-id", ) bytes = _chk_inventory_entry_to_bytes(ie) self.assertEqual( b"tree: tree-root-id\nparent-id\ntree\xce\xa9name\ntree-rev-id\nref-rev-id", bytes, ) ie2 = _chk_inventory_bytes_to_entry(bytes) self.assertEqual(ie, ie2) self.assertIsInstance(ie2.name, str) self.assertEqual( (b"tree\xce\xa9name", b"tree-root-id", b"tree-rev-id"), chk_inventory_bytes_to_utf8name_key(bytes), ) def make_basic_utf8_inventory(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") root_id = inv.root.file_id inv.add( InventoryFile( b"fileid", "f\xefle", root_id, revision=b"filerev", text_sha1=b"ffff", text_size=0, ) ) inv.add( InventoryDirectory( b"dirid", "dir-\N{EURO SIGN}", root_id, revision=b"dirrev" ) ) inv.add( InventoryFile( b"childid", "ch\xefld", b"dirid", revision=b"filerev", text_sha1=b"ffff", text_size=0, ) ) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() return CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) def test__preload_handles_utf8(self): new_inv = self.make_basic_utf8_inventory() self.assertEqual({}, new_inv._fileid_to_entry_cache) self.assertFalse(new_inv._fully_cached) new_inv._preload_cache() self.assertEqual( sorted([new_inv.root_id, b"fileid", b"dirid", b"childid"]), sorted(new_inv._fileid_to_entry_cache.keys()), ) ie_root = new_inv._fileid_to_entry_cache[new_inv.root_id] self.assertEqual( ["dir-\N{EURO SIGN}", "f\xefle"], [ie.name for ie in new_inv.iter_sorted_children(ie_root.file_id)], ) ie_dir = new_inv._fileid_to_entry_cache[b"dirid"] self.assertEqual( ["ch\xefld"], [ie.name for ie in new_inv.iter_sorted_children(ie_dir.file_id)], ) def test__preload_populates_cache(self): inv = Inventory(revision_id=b"revid", root_revision=b"rootrev") root_id = inv.root.file_id inv.add( InventoryFile( b"fileid", "file", root_id, revision=b"filerev", executable=True, text_sha1=b"ffff", text_size=1, ) ) inv.add(InventoryDirectory(b"dirid", "dir", root_id, revision=b"dirrev")) inv.add( InventoryFile( b"childid", "child", b"dirid", revision=b"filerev", executable=False, text_sha1=b"dddd", text_size=1, ) ) chk_bytes = self.get_chk_bytes() chk_inv = CHKInventory.from_inventory(chk_bytes, inv) lines = chk_inv.to_lines() new_inv = CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) self.assertEqual({}, new_inv._fileid_to_entry_cache) self.assertFalse(new_inv._fully_cached) new_inv._preload_cache() self.assertEqual( sorted([root_id, b"fileid", b"dirid", b"childid"]), sorted(new_inv._fileid_to_entry_cache.keys()), ) self.assertTrue(new_inv._fully_cached) ie_root = new_inv._fileid_to_entry_cache[root_id] self.assertEqual( ["dir", "file"], [ie.name for ie in new_inv.iter_sorted_children(ie_root.file_id)], ) ie_dir = new_inv._fileid_to_entry_cache[b"dirid"] self.assertEqual( ["child"], [ie.name for ie in new_inv.iter_sorted_children(ie_dir.file_id)] ) def test__preload_handles_partially_evaluated_inventory(self): new_inv = self.make_basic_utf8_inventory() ie = new_inv.get_entry(new_inv.root_id) self.assertEqual( ["dir-\N{EURO SIGN}", "f\xefle"], [c.name for c in new_inv.iter_sorted_children(ie.file_id)], ) new_inv._preload_cache() # No change self.assertEqual( ["dir-\N{EURO SIGN}", "f\xefle"], [c.name for c in new_inv.iter_sorted_children(ie.file_id)], ) self.assertEqual( ["ch\xefld"], [c.name for c in new_inv.iter_sorted_children(b"dirid")] ) def test_filter_change_in_renamed_subfolder(self): inv = Inventory(b"tree-root", root_revision=b"rootrev") src_ie = inv.add_path("src", "directory", b"src-id", revision=b"srcrev") inv.add_path("src/sub/", "directory", b"sub-id", revision=b"subrev") a_ie = inv.add_path( "src/sub/a", "file", b"a-id", revision=b"filerev", text_sha1=osutils.sha_string(b"content\n"), text_size=len(b"content\n"), ) chk_bytes = self.get_chk_bytes() inv = CHKInventory.from_inventory(chk_bytes, inv) inv = inv.create_by_apply_delta( InventoryDelta( [ ("src/sub/a", "src/sub/a", b"a-id", a_ie), ("src", "src2", b"src-id", src_ie), ] ), b"new-rev-2", ) new_inv = inv.filter([b"a-id", b"src-id"]) self.assertEqual( [ ("", b"tree-root"), ("src", b"src-id"), ("src/sub", b"sub-id"), ("src/sub/a", b"a-id"), ], [(path, ie.file_id) for path, ie in new_inv.iter_entries()], ) class TestCHKInventoryExpand(TestCaseWithMemoryTransport): def get_chk_bytes(self): factory = groupcompress.make_pack_factory(True, True, 1) trans = self.get_transport("") return factory(trans) def make_dir(self, inv, name, parent_id, revision): ie = inv.make_entry( "directory", name, parent_id, name.encode("utf-8") + b"-id", revision=revision, ) inv.add(ie) def make_file(self, inv, name, parent_id, revision, content=b"content\n"): ie = inv.make_entry( "file", name, parent_id, name.encode("utf-8") + b"-id", text_sha1=osutils.sha_string(content), text_size=len(content), revision=revision, ) inv.add(ie) def make_simple_inventory(self): inv = Inventory(b"TREE_ROOT", revision_id=b"revid", root_revision=b"rootrev") # / TREE_ROOT # dir1/ dir1-id # sub-file1 sub-file1-id # sub-file2 sub-file2-id # sub-dir1/ sub-dir1-id # subsub-file1 subsub-file1-id # dir2/ dir2-id # sub2-file1 sub2-file1-id # top top-id self.make_dir(inv, "dir1", b"TREE_ROOT", b"dirrev") self.make_dir(inv, "dir2", b"TREE_ROOT", b"dirrev") self.make_dir(inv, "sub-dir1", b"dir1-id", b"dirrev") self.make_file(inv, "top", b"TREE_ROOT", b"filerev") self.make_file(inv, "sub-file1", b"dir1-id", b"filerev") self.make_file(inv, "sub-file2", b"dir1-id", b"filerev") self.make_file(inv, "subsub-file1", b"sub-dir1-id", b"filerev") self.make_file(inv, "sub2-file1", b"dir2-id", b"filerev") chk_bytes = self.get_chk_bytes() # use a small maximum_size to force internal paging structures chk_inv = CHKInventory.from_inventory( chk_bytes, inv, maximum_size=100, search_key_name=b"hash-255-way" ) lines = chk_inv.to_lines() return CHKInventory.deserialise(chk_bytes, lines, (b"revid",)) def assert_Getitems(self, expected_fileids, inv, file_ids): self.assertEqual( sorted(expected_fileids), sorted([ie.file_id for ie in inv._getitems(file_ids)]), ) def assertExpand(self, all_ids, inv, file_ids): (val_all_ids, val_children) = inv._expand_fileids_to_parents_and_children( file_ids ) self.assertEqual(set(all_ids), val_all_ids) entries = inv._getitems(val_all_ids) expected_children = {} for entry in entries: s = expected_children.setdefault(entry.parent_id, []) s.append(entry.file_id) val_children = {k: sorted(v) for k, v in val_children.items()} expected_children = {k: sorted(v) for k, v in expected_children.items()} self.assertEqual(expected_children, val_children) def test_make_simple_inventory(self): inv = self.make_simple_inventory() layout = [] for path, entry in inv.iter_entries_by_dir(): layout.append((path, entry.file_id)) self.assertEqual( [ ("", b"TREE_ROOT"), ("dir1", b"dir1-id"), ("dir2", b"dir2-id"), ("top", b"top-id"), ("dir1/sub-dir1", b"sub-dir1-id"), ("dir1/sub-file1", b"sub-file1-id"), ("dir1/sub-file2", b"sub-file2-id"), ("dir1/sub-dir1/subsub-file1", b"subsub-file1-id"), ("dir2/sub2-file1", b"sub2-file1-id"), ], layout, ) def test__getitems(self): inv = self.make_simple_inventory() # Reading from disk self.assert_Getitems([b"dir1-id"], inv, [b"dir1-id"]) self.assertIn(b"dir1-id", inv._fileid_to_entry_cache) self.assertNotIn(b"sub-file2-id", inv._fileid_to_entry_cache) # From cache self.assert_Getitems([b"dir1-id"], inv, [b"dir1-id"]) # Mixed self.assert_Getitems( [b"dir1-id", b"sub-file2-id"], inv, [b"dir1-id", b"sub-file2-id"] ) self.assertIn(b"dir1-id", inv._fileid_to_entry_cache) self.assertIn(b"sub-file2-id", inv._fileid_to_entry_cache) def test_single_file(self): inv = self.make_simple_inventory() self.assertExpand([b"TREE_ROOT", b"top-id"], inv, [b"top-id"]) def test_get_all_parents(self): inv = self.make_simple_inventory() self.assertExpand( [ b"TREE_ROOT", b"dir1-id", b"sub-dir1-id", b"subsub-file1-id", ], inv, [b"subsub-file1-id"], ) def test_get_children(self): inv = self.make_simple_inventory() self.assertExpand( [ b"TREE_ROOT", b"dir1-id", b"sub-dir1-id", b"sub-file1-id", b"sub-file2-id", b"subsub-file1-id", ], inv, [b"dir1-id"], ) def test_from_root(self): inv = self.make_simple_inventory() self.assertExpand( [ b"TREE_ROOT", b"dir1-id", b"dir2-id", b"sub-dir1-id", b"sub-file1-id", b"sub-file2-id", b"sub2-file1-id", b"subsub-file1-id", b"top-id", ], inv, [b"TREE_ROOT"], ) def test_top_level_file(self): inv = self.make_simple_inventory() self.assertExpand([b"TREE_ROOT", b"top-id"], inv, [b"top-id"]) def test_subsub_file(self): inv = self.make_simple_inventory() self.assertExpand( [b"TREE_ROOT", b"dir1-id", b"sub-dir1-id", b"subsub-file1-id"], inv, [b"subsub-file1-id"], ) def test_sub_and_root(self): inv = self.make_simple_inventory() self.assertExpand( [b"TREE_ROOT", b"dir1-id", b"sub-dir1-id", b"top-id", b"subsub-file1-id"], inv, [b"top-id", b"subsub-file1-id"], ) class ErrorTests(TestCase): def test_duplicate_file_id(self): error = DuplicateFileId("a_file_id", "foo") self.assertEqualDiff( "File id {a_file_id} already exists in inventory as foo", str(error) ) bzrformats_3.4.0.orig/bzrformats/tests/test_inventory_delta.py0000644000000000000000000007226115162115107022037 0ustar00# Copyright (C) 2009, 2010, 2011, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for bzrformats.inventory_delta. See doc/developer/inventory.txt for more information. """ from io import BytesIO from bzrformats import inventory, inventory_delta from bzrformats.inventory import Inventory, _make_delta from bzrformats.inventory_delta import InventoryDelta, InventoryDeltaError from .. import osutils from ..revision import NULL_REVISION from . import TestCase ### DO NOT REFLOW THESE TEXTS. NEW LINES ARE SIGNIFICANT. ### empty_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: null: versioned_root: true tree_references: true """ root_only_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: entry-version versioned_root: true tree_references: true None\x00/\x00an-id\x00\x00a@e\xc3\xa5ample.com--2004\x00dir """ root_change_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: entry-version version: changed-root versioned_root: true tree_references: true /\x00an-id\x00\x00different-version\x00dir """ corrupt_parent_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: entry-version version: changed-root versioned_root: false tree_references: false /\x00an-id\x00\x00different-version\x00dir """ root_only_unversioned = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: entry-version versioned_root: false tree_references: false None\x00/\x00TREE_ROOT\x00\x00entry-version\x00dir """ reference_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: entry-version versioned_root: true tree_references: true None\x00/\x00TREE_ROOT\x00\x00a@e\xc3\xa5ample.com--2004\x00dir None\x00/foo\x00id\x00TREE_ROOT\x00changed\x00tree\x00subtree-version """ change_tree_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: entry-version version: change-tree versioned_root: false tree_references: false /foo\x00id\x00TREE_ROOT\x00changed-twice\x00tree\x00subtree-version2 """ class TestDeserialization(TestCase): """Test InventoryDeltaSerializer.parse_text_bytes.""" def test_parse_no_bytes(self): """Test that parsing an empty bytes list raises an error.""" deserializer = inventory_delta.InventoryDeltaDeserializer() err = self.assertRaises(InventoryDeltaError, deserializer.parse_text_bytes, []) self.assertContainsRe(str(err), "inventory delta is empty") def test_parse_bad_format(self): """Test that an unknown format string raises an error.""" deserializer = inventory_delta.InventoryDeltaDeserializer() err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, [b"format: foo\n"] ) self.assertContainsRe(str(err), "unknown format") def test_parse_no_parent(self): """Test that a missing parent marker raises an error.""" deserializer = inventory_delta.InventoryDeltaDeserializer() err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, [b"format: bzr inventory delta v1 (bzr 1.14)\n"], ) self.assertContainsRe(str(err), "missing parent: marker") def test_parse_no_version(self): deserializer = inventory_delta.InventoryDeltaDeserializer() err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, [b"format: bzr inventory delta v1 (bzr 1.14)\n", b"parent: null:\n"], ) self.assertContainsRe(str(err), "missing version: marker") def test_parse_duplicate_key_errors(self): deserializer = inventory_delta.InventoryDeltaDeserializer() double_root_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: null: versioned_root: true tree_references: true None\x00/\x00an-id\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00 None\x00/\x00an-id\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00 """ err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(double_root_lines), ) self.assertContainsRe(str(err), "duplicate file id") def test_parse_versioned_root_only(self): deserializer = inventory_delta.InventoryDeltaDeserializer() parse_result = deserializer.parse_text_bytes( osutils.split_lines(root_only_lines) ) expected_entry = inventory.make_entry( "directory", "", None, b"an-id", revision=b"a@e\xc3\xa5ample.com--2004" ) self.assertEqual( ( b"null:", b"entry-version", True, True, InventoryDelta([(None, "", b"an-id", expected_entry)]), ), parse_result, ) def test_parse_special_revid_not_valid_last_mod(self): deserializer = inventory_delta.InventoryDeltaDeserializer() root_only_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: null: versioned_root: false tree_references: true None\x00/\x00TREE_ROOT\x00\x00null:\x00dir\x00\x00 """ err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(root_only_lines), ) self.assertContainsRe(str(err), "special revisionid found") def test_parse_versioned_root_versioned_disabled(self): deserializer = inventory_delta.InventoryDeltaDeserializer() root_only_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: null: versioned_root: false tree_references: true None\x00/\x00TREE_ROOT\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00 """ err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(root_only_lines), ) self.assertContainsRe(str(err), "Versioned root found") def test_parse_unique_root_id_root_versioned_disabled(self): deserializer = inventory_delta.InventoryDeltaDeserializer() root_only_lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: parent-id version: a@e\xc3\xa5ample.com--2004 versioned_root: false tree_references: true None\x00/\x00an-id\x00\x00parent-id\x00dir\x00\x00 """ err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(root_only_lines), ) self.assertContainsRe(str(err), "Versioned root found") def test_parse_unversioned_root_versioning_enabled(self): deserializer = inventory_delta.InventoryDeltaDeserializer() parse_result = deserializer.parse_text_bytes( osutils.split_lines(root_only_unversioned) ) expected_entry = inventory.make_entry( "directory", "", None, b"TREE_ROOT", revision=b"entry-version" ) self.assertEqual( ( b"null:", b"entry-version", False, False, InventoryDelta([(None, "", b"TREE_ROOT", expected_entry)]), ), parse_result, ) def test_parse_versioned_root_when_disabled(self): deserializer = inventory_delta.InventoryDeltaDeserializer( allow_versioned_root=False ) err = self.assertRaises( inventory_delta.IncompatibleInventoryDelta, deserializer.parse_text_bytes, osutils.split_lines(root_only_lines), ) self.assertEqual("versioned_root not allowed", str(err)) def test_parse_tree_when_disabled(self): deserializer = inventory_delta.InventoryDeltaDeserializer( allow_tree_references=False ) err = self.assertRaises( inventory_delta.IncompatibleInventoryDelta, deserializer.parse_text_bytes, osutils.split_lines(reference_lines), ) self.assertEqual("Tree reference not allowed", str(err)) def test_parse_tree_when_header_disallows(self): # A deserializer that allows tree_references to be set or unset. deserializer = inventory_delta.InventoryDeltaDeserializer() # A serialised inventory delta with a header saying no tree refs, but # that has a tree ref in its content. lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: entry-version versioned_root: false tree_references: false None\x00/foo\x00id\x00TREE_ROOT\x00changed\x00tree\x00subtree-version """ err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(lines), ) self.assertContainsRe(str(err), "Tree reference found") def test_parse_versioned_root_when_header_disallows(self): # A deserializer that allows tree_references to be set or unset. deserializer = inventory_delta.InventoryDeltaDeserializer() # A serialised inventory delta with a header saying no tree refs, but # that has a tree ref in its content. lines = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: entry-version versioned_root: false tree_references: false None\x00/\x00TREE_ROOT\x00\x00a@e\xc3\xa5ample.com--2004\x00dir """ err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(lines), ) self.assertContainsRe(str(err), "Versioned root found") def test_parse_last_line_not_empty(self): """Newpath must start with / if it is not None.""" # Trim the trailing newline from a valid serialization lines = root_only_lines[:-1] deserializer = inventory_delta.InventoryDeltaDeserializer() err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(lines), ) self.assertContainsRe(str(err), "last line not empty") def test_parse_invalid_newpath(self): """Newpath must start with / if it is not None.""" lines = empty_lines lines += b"None\x00bad\x00TREE_ROOT\x00\x00version\x00dir\n" deserializer = inventory_delta.InventoryDeltaDeserializer() err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(lines), ) self.assertContainsRe(str(err), "newpath invalid") def test_parse_invalid_oldpath(self): """Oldpath must start with / if it is not None.""" lines = root_only_lines lines += b"bad\x00/new\x00file-id\x00\x00version\x00dir\n" deserializer = inventory_delta.InventoryDeltaDeserializer() err = self.assertRaises( InventoryDeltaError, deserializer.parse_text_bytes, osutils.split_lines(lines), ) self.assertContainsRe(str(err), "oldpath invalid") def test_parse_new_file(self): """A new file is parsed correctly.""" lines = root_only_lines fake_sha = b"deadbeef" * 5 lines += ( b"None\x00/new\x00file-id\x00an-id\x00version\x00file\x00123\x00" + b"\x00" + fake_sha + b"\n" ) deserializer = inventory_delta.InventoryDeltaDeserializer() parse_result = deserializer.parse_text_bytes(osutils.split_lines(lines)) expected_entry = inventory.make_entry( "file", "new", b"an-id", b"file-id", revision=b"version", text_size=123, text_sha1=fake_sha, ) delta = parse_result[4] self.assertEqual((None, "new", b"file-id", expected_entry), delta[-1]) def test_parse_delete(self): lines = root_only_lines lines += b"/old-file\x00None\x00deleted-id\x00\x00null:\x00deleted\x00\x00\n" deserializer = inventory_delta.InventoryDeltaDeserializer() parse_result = deserializer.parse_text_bytes(osutils.split_lines(lines)) delta = parse_result[4] self.assertEqual(("old-file", None, b"deleted-id", None), delta[-1]) class TestSerialization(TestCase): """Tests for InventoryDeltaSerializer.delta_to_lines.""" def test_empty_delta_to_lines(self): old_inv = Inventory(None) new_inv = Inventory(None) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) self.assertEqual( BytesIO(empty_lines).readlines(), serializer.delta_to_lines(NULL_REVISION, NULL_REVISION, delta), ) def test_root_only_to_lines(self): old_inv = Inventory(None) new_inv = Inventory(None) root = new_inv.make_entry( "directory", "", None, b"an-id", revision=b"a@e\xc3\xa5ample.com--2004" ) new_inv.add(root) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) self.assertEqual( BytesIO(root_only_lines).readlines(), serializer.delta_to_lines(NULL_REVISION, b"entry-version", delta), ) def test_unversioned_root(self): old_inv = Inventory(None) new_inv = Inventory(None) # Implicit roots are considered modified in every revision. root = new_inv.make_entry( "directory", "", None, b"TREE_ROOT", revision=b"entry-version" ) new_inv.add(root) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=False, tree_references=False ) serialized_lines = serializer.delta_to_lines( NULL_REVISION, b"entry-version", delta ) self.assertEqual(BytesIO(root_only_unversioned).readlines(), serialized_lines) deserializer = inventory_delta.InventoryDeltaDeserializer() self.assertEqual( (NULL_REVISION, b"entry-version", False, False, delta), deserializer.parse_text_bytes(serialized_lines), ) def test_unversioned_non_root_errors(self): old_inv = Inventory(None) new_inv = Inventory(None) root = new_inv.make_entry( "directory", "", None, b"TREE_ROOT", revision=b"a@e\xc3\xa5ample.com--2004" ) new_inv.add(root) non_root = new_inv.make_entry("directory", "foo", root.file_id, b"id") new_inv.add(non_root) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) err = self.assertRaises( InventoryDeltaError, serializer.delta_to_lines, NULL_REVISION, b"entry-version", delta, ) self.assertContainsRe(str(err), "^no version for fileid id$") def test_richroot_unversioned_root_errors(self): old_inv = Inventory(None) new_inv = Inventory(None) root = new_inv.make_entry("directory", "", None, b"TREE_ROOT") new_inv.add(root) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) err = self.assertRaises( InventoryDeltaError, serializer.delta_to_lines, NULL_REVISION, b"entry-version", delta, ) self.assertContainsRe(str(err), "no version for fileid TREE_ROOT$") def test_nonrichroot_versioned_root_errors(self): old_inv = Inventory(None) new_inv = Inventory(None) root = new_inv.make_entry( "directory", "", None, b"TREE_ROOT", revision=b"a@e\xc3\xa5ample.com--2004" ) new_inv.add(root) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=False, tree_references=True ) err = self.assertRaises( InventoryDeltaError, serializer.delta_to_lines, NULL_REVISION, b"entry-version", delta, ) self.assertContainsRe(str(err), "^Version present for / in TREE_ROOT") def test_tree_reference_disabled(self): old_inv = Inventory(None) new_inv = Inventory(None) root = new_inv.make_entry( "directory", "", None, b"TREE_ROOT", revision=b"a@e\xc3\xa5ample.com--2004" ) new_inv.add(root) non_root = new_inv.make_entry( "tree-reference", "foo", root.file_id, b"id", revision=b"changed", reference_revision=b"subtree-version", ) new_inv.add(non_root) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=False ) # we expect keyerror because there is little value wrapping this. # This test aims to prove that it errors more than how it errors. err = self.assertRaises( KeyError, serializer.delta_to_lines, NULL_REVISION, b"entry-version", delta ) self.assertEqual(("tree-reference",), err.args) def test_tree_reference_enabled(self): old_inv = Inventory(None) new_inv = Inventory(None) root = new_inv.make_entry( "directory", "", None, b"TREE_ROOT", revision=b"a@e\xc3\xa5ample.com--2004" ) new_inv.add(root) non_root = new_inv.make_entry( "tree-reference", "foo", root.file_id, b"id", revision=b"changed", reference_revision=b"subtree-version", ) new_inv.add(non_root) delta = _make_delta(new_inv, old_inv) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) self.assertEqual( BytesIO(reference_lines).readlines(), serializer.delta_to_lines(NULL_REVISION, b"entry-version", delta), ) def test_to_inventory_root_id_versioned_not_permitted(self): root_entry = inventory.make_entry( "directory", "", None, b"TREE_ROOT", revision=b"some-version" ) delta = InventoryDelta([(None, "", b"TREE_ROOT", root_entry)]) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=False, tree_references=True ) self.assertRaises( InventoryDeltaError, serializer.delta_to_lines, b"old-version", b"new-version", delta, ) def test_to_inventory_root_id_not_versioned(self): delta = InventoryDelta( [ ( None, "", b"an-id", inventory.make_entry("directory", "", None, b"an-id"), ) ] ) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) self.assertRaises( InventoryDeltaError, serializer.delta_to_lines, b"old-version", b"new-version", delta, ) def test_to_inventory_has_tree_not_meant_to(self): make_entry = inventory.make_entry tree_ref = make_entry( "tree-reference", "foo", b"changed-in", b"ref-id", reference_revision=b"ref-revision", ) delta = InventoryDelta( [ ( None, "", b"an-id", make_entry("directory", "", b"changed-in", b"an-id"), ), (None, "foo", b"ref-id", tree_ref), # a file that followed the root move ] ) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) self.assertRaises( InventoryDeltaError, serializer.delta_to_lines, b"old-version", b"new-version", delta, ) def test_to_inventory_torture(self): def make_entry(kind, name, parent_id, file_id, **attrs): return inventory.make_entry(kind, name, parent_id, file_id, **attrs) # this delta is crafted to have all the following: # - deletes # - renamed roots # - deep dirs # - files moved after parent dir was renamed # - files with and without exec bit delta = InventoryDelta( [ # new root: ( None, "", b"new-root-id", make_entry( "directory", "", None, b"new-root-id", revision=b"changed-in" ), ), # an old root: ( "", "old-root", b"TREE_ROOT", make_entry( "directory", "subdir-now", b"new-root-id", b"TREE_ROOT", revision=b"moved-root", ), ), # a file that followed the root move ( "under-old-root", "old-root/under-old-root", b"moved-id", make_entry( "file", "under-old-root", b"TREE_ROOT", b"moved-id", revision=b"old-rev", executable=False, text_size=30, text_sha1=b"some-sha", ), ), # a deleted path ("old-file", None, b"deleted-id", None), # a tree reference moved to the new root ( "ref", "ref", b"ref-id", make_entry( "tree-reference", "ref", b"new-root-id", b"ref-id", reference_revision=b"tree-reference-id", revision=b"new-rev", ), ), # a symlink now in a deep dir ( "dir/link", "old-root/dir/link", b"link-id", make_entry( "symlink", "link", b"deep-id", b"link-id", symlink_target="target", revision=b"new-rev", ), ), # a deep dir ( "dir", "old-root/dir", b"deep-id", make_entry( "directory", "dir", b"TREE_ROOT", b"deep-id", revision=b"new-rev", ), ), # a file with an exec bit set ( None, "configure", b"exec-id", make_entry( "file", "configure", b"new-root-id", b"exec-id", executable=True, text_size=30, text_sha1=b"some-sha", revision=b"old-rev", ), ), ] ) serializer = inventory_delta.InventoryDeltaSerializer( versioned_root=True, tree_references=True ) lines = serializer.delta_to_lines(NULL_REVISION, b"something", delta) expected = b"""format: bzr inventory delta v1 (bzr 1.14) parent: null: version: something versioned_root: true tree_references: true /\x00/old-root\x00TREE_ROOT\x00new-root-id\x00moved-root\x00dir /dir\x00/old-root/dir\x00deep-id\x00TREE_ROOT\x00new-rev\x00dir /dir/link\x00/old-root/dir/link\x00link-id\x00deep-id\x00new-rev\x00link\x00target /old-file\x00None\x00deleted-id\x00\x00null:\x00deleted\x00\x00 /ref\x00/ref\x00ref-id\x00new-root-id\x00new-rev\x00tree\x00tree-reference-id /under-old-root\x00/old-root/under-old-root\x00moved-id\x00TREE_ROOT\x00old-rev\x00file\x0030\x00\x00some-sha None\x00/\x00new-root-id\x00\x00changed-in\x00dir None\x00/configure\x00exec-id\x00new-root-id\x00old-rev\x00file\x0030\x00Y\x00some-sha """ serialized = b"".join(lines) self.assertIsInstance(serialized, bytes) self.assertEqual(expected, serialized) class TestContent(TestCase): """Test serialization of the content part of a line.""" def test_dir(self): entry = inventory.make_entry("directory", "a dir", b"parent") self.assertEqual(b"dir", inventory_delta.serialize_inventory_entry(entry)) def test_file_0_short_sha(self): file_entry = inventory.make_entry( "file", "a file", b"parent", b"file-id", text_sha1=b"", text_size=0 ) self.assertEqual( b"file\x000\x00\x00", inventory_delta.serialize_inventory_entry(file_entry) ) def test_file_10_foo(self): file_entry = inventory.make_entry( "file", "a file", b"parent", b"file-id", text_sha1=b"foo", text_size=10 ) self.assertEqual( b"file\x0010\x00\x00foo", inventory_delta.serialize_inventory_entry(file_entry), ) def test_file_executable(self): file_entry = inventory.make_entry( "file", "a file", b"parent", b"file-id", executable=True, text_sha1=b"foo", text_size=10, ) self.assertEqual( b"file\x0010\x00Y\x00foo", inventory_delta.serialize_inventory_entry(file_entry), ) def test_file_without_size(self): file_entry = inventory.make_entry( "file", "a file", b"parent", b"file-id", text_sha1=b"foo" ) self.assertRaises( InventoryDeltaError, inventory_delta.serialize_inventory_entry, file_entry ) def test_file_without_sha1(self): file_entry = inventory.make_entry( "file", "a file", b"parent", b"file-id", text_size=10 ) self.assertRaises( InventoryDeltaError, inventory_delta.serialize_inventory_entry, file_entry ) def test_link_empty_target(self): entry = inventory.make_entry("symlink", "a link", b"parent", symlink_target="") self.assertEqual(b"link\x00", inventory_delta.serialize_inventory_entry(entry)) def test_link_unicode_target(self): entry = inventory.make_entry( "symlink", "a link", b"parent", symlink_target=b" \xc3\xa5".decode("utf8") ) self.assertEqual( b"link\x00 \xc3\xa5", inventory_delta.serialize_inventory_entry(entry) ) def test_link_space_target(self): entry = inventory.make_entry("symlink", "a link", b"parent", symlink_target=" ") self.assertEqual(b"link\x00 ", inventory_delta.serialize_inventory_entry(entry)) def test_link_no_target(self): entry = inventory.make_entry("symlink", "a link", b"parent") self.assertRaises( InventoryDeltaError, inventory_delta.serialize_inventory_entry, entry ) def test_reference_null(self): entry = inventory.make_entry( "tree-reference", "a tree", b"parent", reference_revision=NULL_REVISION ) self.assertEqual( b"tree\x00null:", inventory_delta.serialize_inventory_entry(entry) ) def test_reference_revision(self): entry = inventory.make_entry( "tree-reference", "a tree", b"parent", reference_revision=b"foo@\xc3\xa5b-lah", ) self.assertEqual( b"tree\x00foo@\xc3\xa5b-lah", inventory_delta.serialize_inventory_entry(entry), ) def test_reference_no_reference(self): entry = inventory.make_entry("tree-reference", "a tree", b"parent") self.assertRaises( InventoryDeltaError, inventory_delta.serialize_inventory_entry, entry ) bzrformats_3.4.0.orig/bzrformats/tests/test_knit.py0000644000000000000000000030240015162115103017561 0ustar00# Copyright (C) 2006-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for Knit data structure.""" import gzip from io import BytesIO from patiencediff import PatienceSequenceMatcher from bzrformats import osutils from bzrformats.errors import ReadOnlyError from .. import knit, multiparent, pack_repo from ..index import * # noqa: F403 from ..knit import ( AnnotatedKnitContent, KnitContent, KnitCorrupt, KnitDataStreamIncompatible, KnitDataStreamUnknown, KnitHeaderError, KnitIndexUnknownMethod, KnitVersionedFiles, PlainKnitContent, _KndxIndex, _KnitGraphIndex, _KnitKeyAccess, _VFContentMapGenerator, make_file_factory, ) from ..transport import NoSuchFile as _NoSuchFile from ..versionedfile import ( AbsentContentFactory, ConstantMapper, RecordingVersionedFilesDecorator, network_bytes_to_kind_and_offset, ) from . import ( TestCase, TestCaseWithMemoryTransport, TestNotApplicable, _try_import, ) _compiled_knit_module = _try_import("bzrformats._knit_load_data_pyx") class ErrorTests(TestCase): def test_knit_data_stream_incompatible(self): error = KnitDataStreamIncompatible("stream format", "target format") self.assertEqual( "Cannot insert knit data stream of format " '"stream format" into knit of format ' '"target format".', str(error), ) def test_knit_data_stream_unknown(self): error = KnitDataStreamUnknown("stream format") self.assertEqual( 'Cannot parse knit data stream of format "stream format".', str(error) ) def test_knit_header_error(self): error = KnitHeaderError("line foo\n", "path/to/file") self.assertEqual( "Knit header error: 'line foo\\n' unexpected for file \"path/to/file\".", str(error), ) def test_knit_index_unknown_method(self): error = KnitIndexUnknownMethod("http://host/foo.kndx", ["bad", "no-eol"]) self.assertEqual( "Knit index http://host/foo.kndx does not have a" " known method in options: ['bad', 'no-eol']", str(error), ) class KnitContentTestsMixin: def test_constructor(self): self._make_content([]) def test_text(self): content = self._make_content([]) self.assertEqual(content.text(), []) content = self._make_content([(b"origin1", b"text1"), (b"origin2", b"text2")]) self.assertEqual(content.text(), [b"text1", b"text2"]) def test_copy(self): content = self._make_content([(b"origin1", b"text1"), (b"origin2", b"text2")]) copy = content.copy() self.assertIsInstance(copy, content.__class__) self.assertEqual(copy.annotate(), content.annotate()) def assertDerivedBlocksEqual(self, source, target, noeol=False): """Assert that the derived matching blocks match real output.""" source_lines = source.splitlines(True) target_lines = target.splitlines(True) def nl(line): if noeol and not line.endswith("\n"): return line + "\n" else: return line source_content = self._make_content([(None, nl(l)) for l in source_lines]) target_content = self._make_content([(None, nl(l)) for l in target_lines]) line_delta = source_content.line_delta(target_content) delta_blocks = list( KnitContent.get_line_delta_blocks(line_delta, source_lines, target_lines) ) matcher = PatienceSequenceMatcher(None, source_lines, target_lines) matcher_blocks = list(matcher.get_matching_blocks()) self.assertEqual(matcher_blocks, delta_blocks) def test_get_line_delta_blocks(self): self.assertDerivedBlocksEqual("a\nb\nc\n", "q\nc\n") self.assertDerivedBlocksEqual(TEXT_1, TEXT_1) self.assertDerivedBlocksEqual(TEXT_1, TEXT_1A) self.assertDerivedBlocksEqual(TEXT_1, TEXT_1B) self.assertDerivedBlocksEqual(TEXT_1B, TEXT_1A) self.assertDerivedBlocksEqual(TEXT_1A, TEXT_1B) self.assertDerivedBlocksEqual(TEXT_1A, "") self.assertDerivedBlocksEqual("", TEXT_1A) self.assertDerivedBlocksEqual("", "") self.assertDerivedBlocksEqual("a\nb\nc", "a\nb\nc\nd") def test_get_line_delta_blocks_noeol(self): """Handle historical knit deltas safely. Some existing knit deltas don't consider the last line to differ when the only difference whether it has a final newline. New knit deltas appear to always consider the last line to differ in this case. """ self.assertDerivedBlocksEqual("a\nb\nc", "a\nb\nc\nd\n", noeol=True) self.assertDerivedBlocksEqual("a\nb\nc\nd\n", "a\nb\nc", noeol=True) self.assertDerivedBlocksEqual("a\nb\nc\n", "a\nb\nc", noeol=True) self.assertDerivedBlocksEqual("a\nb\nc", "a\nb\nc\n", noeol=True) TEXT_1 = """\ Banana cup cakes: - bananas - eggs - broken tea cups """ TEXT_1A = """\ Banana cup cake recipe (serves 6) - bananas - eggs - broken tea cups - self-raising flour """ TEXT_1B = """\ Banana cup cake recipe - bananas (do not use plantains!!!) - broken tea cups - flour """ delta_1_1a = """\ 0,1,2 Banana cup cake recipe (serves 6) 5,5,1 - self-raising flour """ TEXT_2 = """\ Boeuf bourguignon - beef - red wine - small onions - carrot - mushrooms """ class TestPlainKnitContent(TestCase, KnitContentTestsMixin): def _make_content(self, lines): annotated_content = AnnotatedKnitContent(lines) return PlainKnitContent(annotated_content.text(), "bogus") def test_annotate(self): content = self._make_content([]) self.assertEqual(content.annotate(), []) content = self._make_content([("origin1", "text1"), ("origin2", "text2")]) self.assertEqual(content.annotate(), [("bogus", "text1"), ("bogus", "text2")]) def test_line_delta(self): content1 = self._make_content([("", "a"), ("", "b")]) content2 = self._make_content([("", "a"), ("", "a"), ("", "c")]) self.assertEqual(content1.line_delta(content2), [(1, 2, 2, ["a", "c"])]) def test_line_delta_iter(self): content1 = self._make_content([("", "a"), ("", "b")]) content2 = self._make_content([("", "a"), ("", "a"), ("", "c")]) it = content1.line_delta_iter(content2) self.assertEqual(next(it), (1, 2, 2, ["a", "c"])) self.assertRaises(StopIteration, next, it) class TestAnnotatedKnitContent(TestCase, KnitContentTestsMixin): def _make_content(self, lines): return AnnotatedKnitContent(lines) def test_annotate(self): content = self._make_content([]) self.assertEqual(content.annotate(), []) content = self._make_content([(b"origin1", b"text1"), (b"origin2", b"text2")]) self.assertEqual( content.annotate(), [(b"origin1", b"text1"), (b"origin2", b"text2")] ) def test_line_delta(self): content1 = self._make_content([("", "a"), ("", "b")]) content2 = self._make_content([("", "a"), ("", "a"), ("", "c")]) self.assertEqual( content1.line_delta(content2), [(1, 2, 2, [("", "a"), ("", "c")])] ) def test_line_delta_iter(self): content1 = self._make_content([("", "a"), ("", "b")]) content2 = self._make_content([("", "a"), ("", "a"), ("", "c")]) it = content1.line_delta_iter(content2) self.assertEqual(next(it), (1, 2, 2, [("", "a"), ("", "c")])) self.assertRaises(StopIteration, next, it) class MockTransport: def __init__(self, file_lines=None): self.file_lines = file_lines self.calls = [] # We have no base directory for the MockTransport self.base = "" def get(self, filename): if self.file_lines is None: raise _NoSuchFile(filename) else: return BytesIO(b"\n".join(self.file_lines)) def readv(self, relpath, offsets): fp = self.get(relpath) for offset, size in offsets: fp.seek(offset) yield offset, fp.read(size) def __getattr__(self, name): def queue_call(*args, **kwargs): self.calls.append((name, args, kwargs)) return queue_call class MockReadvFailingTransport(MockTransport): """Fail in the middle of a readv() result. This Transport will successfully yield the first two requested hunks, but raise NoSuchFile for the rest. """ def readv(self, relpath, offsets): for count, result in enumerate(MockTransport.readv(self, relpath, offsets), 1): # we use 2 because the first offset is the pack header, the second # is the first actual content requset if count > 2: raise _NoSuchFile(relpath) yield result class KnitRecordAccessTestsMixin: """Tests for getting and putting knit records.""" def test_add_raw_records(self): """add_raw_records adds records retrievable later.""" access = self.get_access() memos = access.add_raw_records([(b"key", 10)], [b"1234567890"]) self.assertEqual([b"1234567890"], list(access.get_raw_records(memos))) def test_add_raw_record(self): """add_raw_record adds records retrievable later.""" access = self.get_access() memos = access.add_raw_record(b"key", 10, [b"1234567890"]) self.assertEqual([b"1234567890"], list(access.get_raw_records([memos]))) def test_add_several_raw_records(self): """add_raw_records with many records and read some back.""" access = self.get_access() memos = access.add_raw_records( [(b"key", 10), (b"key2", 2), (b"key3", 5)], [b"12345678901234567"] ) self.assertEqual( [b"1234567890", b"12", b"34567"], list(access.get_raw_records(memos)) ) self.assertEqual([b"1234567890"], list(access.get_raw_records(memos[0:1]))) self.assertEqual([b"12"], list(access.get_raw_records(memos[1:2]))) self.assertEqual([b"34567"], list(access.get_raw_records(memos[2:3]))) self.assertEqual( [b"1234567890", b"34567"], list(access.get_raw_records(memos[0:1] + memos[2:3])), ) class TestKnitKnitAccess(TestCaseWithMemoryTransport, KnitRecordAccessTestsMixin): """Tests for the .kndx implementation.""" def get_access(self): """Get a .knit style access instance.""" mapper = ConstantMapper("foo") access = _KnitKeyAccess(self.get_transport(), mapper) return access class LowLevelKnitDataTests(TestCase): def create_gz_content(self, text): sio = BytesIO() with gzip.GzipFile(mode="wb", fileobj=sio) as gz_file: gz_file.write(text) return sio.getvalue() def make_multiple_records(self): """Create the content for multiple records.""" sha1sum = osutils.sha_string(b"foo\nbar\n") total_txt = [] gz_txt = self.create_gz_content( b"version rev-id-1 2 %s\nfoo\nbar\nend rev-id-1\n" % (sha1sum,) ) record_1 = (0, len(gz_txt), sha1sum) total_txt.append(gz_txt) sha1sum = osutils.sha_string(b"baz\n") gz_txt = self.create_gz_content( b"version rev-id-2 1 %s\nbaz\nend rev-id-2\n" % (sha1sum,) ) record_2 = (record_1[1], len(gz_txt), sha1sum) total_txt.append(gz_txt) return total_txt, record_1, record_2 def test_valid_knit_data(self): sha1sum = osutils.sha_string(b"foo\nbar\n") gz_txt = self.create_gz_content( b"version rev-id-1 2 %s\nfoo\nbar\nend rev-id-1\n" % (sha1sum,) ) transport = MockTransport([gz_txt]) access = _KnitKeyAccess(transport, ConstantMapper("filename")) knit = KnitVersionedFiles(None, access) records = [((b"rev-id-1",), ((b"rev-id-1",), 0, len(gz_txt)))] contents = list(knit._read_records_iter(records)) self.assertEqual( [ ( (b"rev-id-1",), [b"foo\n", b"bar\n"], b"4e48e2c9a3d2ca8a708cb0cc545700544efb5021", ) ], contents, ) raw_contents = list(knit._read_records_iter_raw(records)) self.assertEqual([((b"rev-id-1",), gz_txt, sha1sum)], raw_contents) def test_multiple_records_valid(self): total_txt, record_1, record_2 = self.make_multiple_records() transport = MockTransport([b"".join(total_txt)]) access = _KnitKeyAccess(transport, ConstantMapper("filename")) knit = KnitVersionedFiles(None, access) records = [ ((b"rev-id-1",), ((b"rev-id-1",), record_1[0], record_1[1])), ((b"rev-id-2",), ((b"rev-id-2",), record_2[0], record_2[1])), ] contents = list(knit._read_records_iter(records)) self.assertEqual( [ ((b"rev-id-1",), [b"foo\n", b"bar\n"], record_1[2]), ((b"rev-id-2",), [b"baz\n"], record_2[2]), ], contents, ) raw_contents = list(knit._read_records_iter_raw(records)) self.assertEqual( [ ((b"rev-id-1",), total_txt[0], record_1[2]), ((b"rev-id-2",), total_txt[1], record_2[2]), ], raw_contents, ) def test_not_enough_lines(self): sha1sum = osutils.sha_string(b"foo\n") # record says 2 lines data says 1 gz_txt = self.create_gz_content( b"version rev-id-1 2 %s\nfoo\nend rev-id-1\n" % (sha1sum,) ) transport = MockTransport([gz_txt]) access = _KnitKeyAccess(transport, ConstantMapper("filename")) knit = KnitVersionedFiles(None, access) records = [((b"rev-id-1",), ((b"rev-id-1",), 0, len(gz_txt)))] self.assertRaises(KnitCorrupt, list, knit._read_records_iter(records)) # read_records_iter_raw won't detect that sort of mismatch/corruption raw_contents = list(knit._read_records_iter_raw(records)) self.assertEqual([((b"rev-id-1",), gz_txt, sha1sum)], raw_contents) def test_too_many_lines(self): sha1sum = osutils.sha_string(b"foo\nbar\n") # record says 1 lines data says 2 gz_txt = self.create_gz_content( b"version rev-id-1 1 %s\nfoo\nbar\nend rev-id-1\n" % (sha1sum,) ) transport = MockTransport([gz_txt]) access = _KnitKeyAccess(transport, ConstantMapper("filename")) knit = KnitVersionedFiles(None, access) records = [((b"rev-id-1",), ((b"rev-id-1",), 0, len(gz_txt)))] self.assertRaises(KnitCorrupt, list, knit._read_records_iter(records)) # read_records_iter_raw won't detect that sort of mismatch/corruption raw_contents = list(knit._read_records_iter_raw(records)) self.assertEqual([((b"rev-id-1",), gz_txt, sha1sum)], raw_contents) def test_mismatched_version_id(self): sha1sum = osutils.sha_string(b"foo\nbar\n") gz_txt = self.create_gz_content( b"version rev-id-1 2 %s\nfoo\nbar\nend rev-id-1\n" % (sha1sum,) ) transport = MockTransport([gz_txt]) access = _KnitKeyAccess(transport, ConstantMapper("filename")) knit = KnitVersionedFiles(None, access) # We are asking for rev-id-2, but the data is rev-id-1 records = [((b"rev-id-2",), ((b"rev-id-2",), 0, len(gz_txt)))] self.assertRaises(KnitCorrupt, list, knit._read_records_iter(records)) # read_records_iter_raw detects mismatches in the header self.assertRaises(KnitCorrupt, list, knit._read_records_iter_raw(records)) def test_uncompressed_data(self): sha1sum = osutils.sha_string(b"foo\nbar\n") txt = b"version rev-id-1 2 %s\nfoo\nbar\nend rev-id-1\n" % (sha1sum,) transport = MockTransport([txt]) access = _KnitKeyAccess(transport, ConstantMapper("filename")) knit = KnitVersionedFiles(None, access) records = [((b"rev-id-1",), ((b"rev-id-1",), 0, len(txt)))] # We don't have valid gzip data ==> corrupt self.assertRaises(KnitCorrupt, list, knit._read_records_iter(records)) # read_records_iter_raw will notice the bad data self.assertRaises(KnitCorrupt, list, knit._read_records_iter_raw(records)) def test_corrupted_data(self): sha1sum = osutils.sha_string(b"foo\nbar\n") gz_txt = self.create_gz_content( b"version rev-id-1 2 %s\nfoo\nbar\nend rev-id-1\n" % (sha1sum,) ) # Change 2 bytes in the middle to \xff gz_txt = gz_txt[:10] + b"\xff\xff" + gz_txt[12:] transport = MockTransport([gz_txt]) access = _KnitKeyAccess(transport, ConstantMapper("filename")) knit = KnitVersionedFiles(None, access) records = [((b"rev-id-1",), ((b"rev-id-1",), 0, len(gz_txt)))] self.assertRaises(KnitCorrupt, list, knit._read_records_iter(records)) # read_records_iter_raw will barf on bad gz data self.assertRaises(KnitCorrupt, list, knit._read_records_iter_raw(records)) class LowLevelKnitIndexTests(TestCase): @property def _load_data(self): from .._knit_load_data_py import _load_data_py return _load_data_py def get_knit_index(self, transport, name, mode): mapper = ConstantMapper(name) self.overrideAttr(knit, "_load_data", self._load_data) def allow_writes(): return "w" in mode return _KndxIndex(transport, mapper, lambda: None, allow_writes, lambda: True) def test_create_file(self): transport = MockTransport() index = self.get_knit_index(transport, "filename", "w") index.keys() call = transport.calls.pop(0) # call[1][1] is a BytesIO - we can't test it by simple equality. self.assertEqual("put_file_non_atomic", call[0]) self.assertEqual("filename.kndx", call[1][0]) # With no history, _KndxIndex writes a new index: self.assertEqual(_KndxIndex.HEADER, call[1][1].getvalue()) self.assertEqual({"create_parent_dir": True}, call[2]) def test_read_utf8_version_id(self): unicode_revision_id = "version-\N{CYRILLIC CAPITAL LETTER A}" utf8_revision_id = unicode_revision_id.encode("utf-8") transport = MockTransport( [_KndxIndex.HEADER, b"%s option 0 1 :" % (utf8_revision_id,)] ) index = self.get_knit_index(transport, "filename", "r") # _KndxIndex is a private class, and deals in utf8 revision_ids, not # Unicode revision_ids. self.assertEqual({(utf8_revision_id,): ()}, index.get_parent_map(index.keys())) self.assertNotIn((unicode_revision_id,), index.keys()) def test_read_utf8_parents(self): unicode_revision_id = "version-\N{CYRILLIC CAPITAL LETTER A}" utf8_revision_id = unicode_revision_id.encode("utf-8") transport = MockTransport( [_KndxIndex.HEADER, b"version option 0 1 .%s :" % (utf8_revision_id,)] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual( {(b"version",): ((utf8_revision_id,),)}, index.get_parent_map(index.keys()) ) def test_read_ignore_corrupted_lines(self): transport = MockTransport( [ _KndxIndex.HEADER, b"corrupted", b"corrupted options 0 1 .b .c ", b"version options 0 1 :", ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual(1, len(index.keys())) self.assertEqual({(b"version",)}, index.keys()) def test_read_corrupted_header(self): transport = MockTransport([b"not a bzr knit index header\n"]) index = self.get_knit_index(transport, "filename", "r") self.assertRaises(KnitHeaderError, index.keys) def test_read_duplicate_entries(self): transport = MockTransport( [ _KndxIndex.HEADER, b"parent options 0 1 :", b"version options1 0 1 0 :", b"version options2 1 2 .other :", b"version options3 3 4 0 .other :", ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual(2, len(index.keys())) # check that the index used is the first one written. (Specific # to KnitIndex style indices. self.assertEqual(b"1", index._dictionary_compress([(b"version",)])) self.assertEqual(((b"version",), 3, 4), index.get_position((b"version",))) self.assertEqual([b"options3"], index.get_options((b"version",))) self.assertEqual( {(b"version",): ((b"parent",), (b"other",))}, index.get_parent_map([(b"version",)]), ) def test_read_compressed_parents(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 1 :", b"b option 0 1 0 :", b"c option 0 1 1 0 :", ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual( {(b"b",): ((b"a",),), (b"c",): ((b"b",), (b"a",))}, index.get_parent_map([(b"b",), (b"c",)]), ) def test_write_utf8_version_id(self): unicode_revision_id = "version-\N{CYRILLIC CAPITAL LETTER A}" utf8_revision_id = unicode_revision_id.encode("utf-8") transport = MockTransport([_KndxIndex.HEADER]) index = self.get_knit_index(transport, "filename", "r") index.add_records( [((utf8_revision_id,), [b"option"], ((utf8_revision_id,), 0, 1), [])] ) call = transport.calls.pop(0) # call[1][1] is a BytesIO - we can't test it by simple equality. self.assertEqual("put_file_non_atomic", call[0]) self.assertEqual("filename.kndx", call[1][0]) # With no history, _KndxIndex writes a new index: self.assertEqual( _KndxIndex.HEADER + b"\n%s option 0 1 :" % (utf8_revision_id,), call[1][1].getvalue(), ) self.assertEqual({"create_parent_dir": True}, call[2]) def test_write_utf8_parents(self): unicode_revision_id = "version-\N{CYRILLIC CAPITAL LETTER A}" utf8_revision_id = unicode_revision_id.encode("utf-8") transport = MockTransport([_KndxIndex.HEADER]) index = self.get_knit_index(transport, "filename", "r") index.add_records( [((b"version",), [b"option"], ((b"version",), 0, 1), [(utf8_revision_id,)])] ) call = transport.calls.pop(0) # call[1][1] is a BytesIO - we can't test it by simple equality. self.assertEqual("put_file_non_atomic", call[0]) self.assertEqual("filename.kndx", call[1][0]) # With no history, _KndxIndex writes a new index: self.assertEqual( _KndxIndex.HEADER + b"\nversion option 0 1 .%s :" % (utf8_revision_id,), call[1][1].getvalue(), ) self.assertEqual({"create_parent_dir": True}, call[2]) def test_keys(self): transport = MockTransport([_KndxIndex.HEADER]) index = self.get_knit_index(transport, "filename", "r") self.assertEqual(set(), index.keys()) index.add_records([((b"a",), [b"option"], ((b"a",), 0, 1), [])]) self.assertEqual({(b"a",)}, index.keys()) index.add_records([((b"a",), [b"option"], ((b"a",), 0, 1), [])]) self.assertEqual({(b"a",)}, index.keys()) index.add_records([((b"b",), [b"option"], ((b"b",), 0, 1), [])]) self.assertEqual({(b"a",), (b"b",)}, index.keys()) def add_a_b(self, index, random_id=None): kwargs = {} if random_id is not None: kwargs["random_id"] = random_id index.add_records( [ ((b"a",), [b"option"], ((b"a",), 0, 1), [(b"b",)]), ((b"a",), [b"opt"], ((b"a",), 1, 2), [(b"c",)]), ((b"b",), [b"option"], ((b"b",), 2, 3), [(b"a",)]), ], **kwargs, ) def assertIndexIsAB(self, index): self.assertEqual( { (b"a",): ((b"c",),), (b"b",): ((b"a",),), }, index.get_parent_map(index.keys()), ) self.assertEqual(((b"a",), 1, 2), index.get_position((b"a",))) self.assertEqual(((b"b",), 2, 3), index.get_position((b"b",))) self.assertEqual([b"opt"], index.get_options((b"a",))) def test_add_versions(self): transport = MockTransport([_KndxIndex.HEADER]) index = self.get_knit_index(transport, "filename", "r") self.add_a_b(index) call = transport.calls.pop(0) # call[1][1] is a BytesIO - we can't test it by simple equality. self.assertEqual("put_file_non_atomic", call[0]) self.assertEqual("filename.kndx", call[1][0]) # With no history, _KndxIndex writes a new index: self.assertEqual( _KndxIndex.HEADER + b"\na option 0 1 .b :" b"\na opt 1 2 .c :" b"\nb option 2 3 0 :", call[1][1].getvalue(), ) self.assertEqual({"create_parent_dir": True}, call[2]) self.assertIndexIsAB(index) def test_add_versions_random_id_is_accepted(self): transport = MockTransport([_KndxIndex.HEADER]) index = self.get_knit_index(transport, "filename", "r") self.add_a_b(index, random_id=True) def test_delay_create_and_add_versions(self): transport = MockTransport() index = self.get_knit_index(transport, "filename", "w") # dir_mode=0777) self.assertEqual([], transport.calls) self.add_a_b(index) # self.assertEqual( # [ {"dir_mode": 0777, "create_parent_dir": True, "mode": "wb"}, # kwargs) # Two calls: one during which we load the existing index (and when its # missing create it), then a second where we write the contents out. self.assertEqual(2, len(transport.calls)) call = transport.calls.pop(0) self.assertEqual("put_file_non_atomic", call[0]) self.assertEqual("filename.kndx", call[1][0]) # With no history, _KndxIndex writes a new index: self.assertEqual(_KndxIndex.HEADER, call[1][1].getvalue()) self.assertEqual({"create_parent_dir": True}, call[2]) call = transport.calls.pop(0) # call[1][1] is a BytesIO - we can't test it by simple equality. self.assertEqual("put_file_non_atomic", call[0]) self.assertEqual("filename.kndx", call[1][0]) # With no history, _KndxIndex writes a new index: self.assertEqual( _KndxIndex.HEADER + b"\na option 0 1 .b :" b"\na opt 1 2 .c :" b"\nb option 2 3 0 :", call[1][1].getvalue(), ) self.assertEqual({"create_parent_dir": True}, call[2]) def assertTotalBuildSize(self, size, keys, positions): self.assertEqual(size, knit._get_total_build_size(None, keys, positions)) def test__get_total_build_size(self): positions = { (b"a",): (("fulltext", False), ((b"a",), 0, 100), None), (b"b",): (("line-delta", False), ((b"b",), 100, 21), (b"a",)), (b"c",): (("line-delta", False), ((b"c",), 121, 35), (b"b",)), (b"d",): (("line-delta", False), ((b"d",), 156, 12), (b"b",)), } self.assertTotalBuildSize(100, [(b"a",)], positions) self.assertTotalBuildSize(121, [(b"b",)], positions) # c needs both a & b self.assertTotalBuildSize(156, [(b"c",)], positions) # we shouldn't count 'b' twice self.assertTotalBuildSize(156, [(b"b",), (b"c",)], positions) self.assertTotalBuildSize(133, [(b"d",)], positions) self.assertTotalBuildSize(168, [(b"c",), (b"d",)], positions) def test_get_position(self): transport = MockTransport( [_KndxIndex.HEADER, b"a option 0 1 :", b"b option 1 2 :"] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual(((b"a",), 0, 1), index.get_position((b"a",))) self.assertEqual(((b"b",), 1, 2), index.get_position((b"b",))) def test_get_method(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a fulltext,unknown 0 1 :", b"b unknown,line-delta 1 2 :", b"c bad 3 4 :", ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual("fulltext", index.get_method(b"a")) self.assertEqual("line-delta", index.get_method(b"b")) self.assertRaises(knit.KnitIndexUnknownMethod, index.get_method, b"c") def test_get_options(self): transport = MockTransport( [_KndxIndex.HEADER, b"a opt1 0 1 :", b"b opt2,opt3 1 2 :"] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual([b"opt1"], index.get_options(b"a")) self.assertEqual([b"opt2", b"opt3"], index.get_options(b"b")) def test_get_parent_map(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 1 :", b"b option 1 2 0 .c :", b"c option 1 2 1 0 .e :", ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual( { (b"a",): (), (b"b",): ((b"a",), (b"c",)), (b"c",): ((b"b",), (b"a",), (b"e",)), }, index.get_parent_map(index.keys()), ) def test_impossible_parent(self): """Test we get KnitCorrupt if the parent couldn't possibly exist.""" transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 1 :", b"b option 0 1 4 :", # We don't have a 4th record ] ) index = self.get_knit_index(transport, "filename", "r") self.assertRaises(KnitCorrupt, index.keys) def test_corrupted_parent(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 1 :", b"b option 0 1 :", b"c option 0 1 1v :", # Can't have a parent of '1v' ] ) index = self.get_knit_index(transport, "filename", "r") self.assertRaises(KnitCorrupt, index.keys) def test_corrupted_parent_in_list(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 1 :", b"b option 0 1 :", b"c option 0 1 1 v :", # Can't have a parent of 'v' ] ) index = self.get_knit_index(transport, "filename", "r") self.assertRaises(KnitCorrupt, index.keys) def test_invalid_position(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a option 1v 1 :", ] ) index = self.get_knit_index(transport, "filename", "r") self.assertRaises(KnitCorrupt, index.keys) def test_invalid_size(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a option 1 1v :", ] ) index = self.get_knit_index(transport, "filename", "r") self.assertRaises(KnitCorrupt, index.keys) def test_scan_unvalidated_index_not_implemented(self): transport = MockTransport() index = self.get_knit_index(transport, "filename", "r") self.assertRaises( NotImplementedError, index.scan_unvalidated_index, "dummy graph_index" ) self.assertRaises(NotImplementedError, index.get_missing_compression_parents) def test_short_line(self): transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 10 :", b"b option 10 10 0", # This line isn't terminated, ignored ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual({(b"a",)}, index.keys()) def test_skip_incomplete_record(self): # A line with bogus data should just be skipped transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 10 :", b"b option 10 10 0", # This line isn't terminated, ignored b"c option 20 10 0 :", # Properly terminated, and starts with '\n' ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual({(b"a",), (b"c",)}, index.keys()) def test_trailing_characters(self): # A line with bogus data should just be skipped transport = MockTransport( [ _KndxIndex.HEADER, b"a option 0 10 :", b"b option 10 10 0 :a", # This line has extra trailing characters b"c option 20 10 0 :", # Properly terminated, and starts with '\n' ] ) index = self.get_knit_index(transport, "filename", "r") self.assertEqual({(b"a",), (b"c",)}, index.keys()) class LowLevelKnitIndexTests_c(LowLevelKnitIndexTests): def setUp(self): super().setUp() if _compiled_knit_module is None: self.skipTest("bzrformats._knit_load_data_pyx not available") @property def _load_data(self): from .._knit_load_data_pyx import _load_data_c return _load_data_c class Test_KnitAnnotator(TestCaseWithMemoryTransport): def make_annotator(self): factory = knit.make_pack_factory(True, True, 1) vf = factory(self.get_transport()) return knit._KnitAnnotator(vf) def test__expand_fulltext(self): ann = self.make_annotator() rev_key = (b"rev-id",) ann._num_compression_children[rev_key] = 1 res = ann._expand_record( rev_key, ((b"parent-id",),), None, [b"line1\n", b"line2\n"], ("fulltext", True), ) # The content object and text lines should be cached appropriately self.assertEqual([b"line1\n", b"line2"], res) content_obj = ann._content_objects[rev_key] self.assertEqual([b"line1\n", b"line2\n"], content_obj._lines) self.assertEqual(res, content_obj.text()) self.assertEqual(res, ann._text_cache[rev_key]) def test__expand_delta_comp_parent_not_available(self): # Parent isn't available yet, so we return nothing, but queue up this # node for later processing ann = self.make_annotator() rev_key = (b"rev-id",) parent_key = (b"parent-id",) record = [b"0,1,1\n", b"new-line\n"] details = ("line-delta", False) res = ann._expand_record(rev_key, (parent_key,), parent_key, record, details) self.assertEqual(None, res) self.assertIn(parent_key, ann._pending_deltas) pending = ann._pending_deltas[parent_key] self.assertEqual(1, len(pending)) self.assertEqual((rev_key, (parent_key,), record, details), pending[0]) def test__expand_record_tracks_num_children(self): ann = self.make_annotator() rev_key = (b"rev-id",) rev2_key = (b"rev2-id",) parent_key = (b"parent-id",) record = [b"0,1,1\n", b"new-line\n"] details = ("line-delta", False) ann._num_compression_children[parent_key] = 2 ann._expand_record( parent_key, (), None, [b"line1\n", b"line2\n"], ("fulltext", False) ) ann._expand_record(rev_key, (parent_key,), parent_key, record, details) self.assertEqual({parent_key: 1}, ann._num_compression_children) # Expanding the second child should remove the content object, and the # num_compression_children entry ann._expand_record(rev2_key, (parent_key,), parent_key, record, details) self.assertNotIn(parent_key, ann._content_objects) self.assertEqual({}, ann._num_compression_children) # We should not cache the content_objects for rev2 and rev, because # they do not have compression children of their own. self.assertEqual({}, ann._content_objects) def test__expand_delta_records_blocks(self): ann = self.make_annotator() rev_key = (b"rev-id",) parent_key = (b"parent-id",) record = [b"0,1,1\n", b"new-line\n"] details = ("line-delta", True) ann._num_compression_children[parent_key] = 2 ann._expand_record( parent_key, (), None, [b"line1\n", b"line2\n", b"line3\n"], ("fulltext", False), ) ann._expand_record(rev_key, (parent_key,), parent_key, record, details) self.assertEqual( {(rev_key, parent_key): [(1, 1, 1), (3, 3, 0)]}, ann._matching_blocks ) rev2_key = (b"rev2-id",) record = [b"0,1,1\n", b"new-line\n"] details = ("line-delta", False) ann._expand_record(rev2_key, (parent_key,), parent_key, record, details) self.assertEqual( [(1, 1, 2), (3, 3, 0)], ann._matching_blocks[(rev2_key, parent_key)] ) def test__get_parent_ann_uses_matching_blocks(self): ann = self.make_annotator() rev_key = (b"rev-id",) parent_key = (b"parent-id",) parent_ann = [(parent_key,)] * 3 block_key = (rev_key, parent_key) ann._annotations_cache[parent_key] = parent_ann ann._matching_blocks[block_key] = [(0, 1, 1), (3, 3, 0)] # We should not try to access any parent_lines content, because we know # we already have the matching blocks par_ann, blocks = ann._get_parent_annotations_and_matches( rev_key, [b"1\n", b"2\n", b"3\n"], parent_key ) self.assertEqual(parent_ann, par_ann) self.assertEqual([(0, 1, 1), (3, 3, 0)], blocks) self.assertEqual({}, ann._matching_blocks) def test__process_pending(self): ann = self.make_annotator() rev_key = (b"rev-id",) p1_key = (b"p1-id",) p2_key = (b"p2-id",) record = [b"0,1,1\n", b"new-line\n"] details = ("line-delta", False) p1_record = [b"line1\n", b"line2\n"] ann._num_compression_children[p1_key] = 1 res = ann._expand_record(rev_key, (p1_key, p2_key), p1_key, record, details) self.assertEqual(None, res) # self.assertTrue(p1_key in ann._pending_deltas) self.assertEqual({}, ann._pending_annotation) # Now insert p1, and we should be able to expand the delta res = ann._expand_record(p1_key, (), None, p1_record, ("fulltext", False)) self.assertEqual(p1_record, res) ann._annotations_cache[p1_key] = [(p1_key,)] * 2 res = ann._process_pending(p1_key) self.assertEqual([], res) self.assertNotIn(p1_key, ann._pending_deltas) self.assertIn(p2_key, ann._pending_annotation) self.assertEqual( {p2_key: [(rev_key, (p1_key, p2_key))]}, ann._pending_annotation ) # Now fill in parent 2, and pending annotation should be satisfied res = ann._expand_record(p2_key, (), None, [], ("fulltext", False)) ann._annotations_cache[p2_key] = [] res = ann._process_pending(p2_key) self.assertEqual([rev_key], res) self.assertEqual({}, ann._pending_annotation) self.assertEqual({}, ann._pending_deltas) def test_record_delta_removes_basis(self): ann = self.make_annotator() ann._expand_record( (b"parent-id",), (), None, [b"line1\n", b"line2\n"], ("fulltext", False) ) ann._num_compression_children[b"parent-id"] = 2 def test_annotate_special_text(self): ann = self.make_annotator() vf = ann._vf rev1_key = (b"rev-1",) rev2_key = (b"rev-2",) rev3_key = (b"rev-3",) spec_key = (b"special:",) vf.add_lines(rev1_key, [], [b"initial content\n"]) vf.add_lines( rev2_key, [rev1_key], [b"initial content\n", b"common content\n", b"content in 2\n"], ) vf.add_lines( rev3_key, [rev1_key], [b"initial content\n", b"common content\n", b"content in 3\n"], ) spec_text = b"initial content\ncommon content\ncontent in 2\ncontent in 3\n" ann.add_special_text(spec_key, [rev2_key, rev3_key], spec_text) anns, lines = ann.annotate(spec_key) self.assertEqual( [ (rev1_key,), (rev2_key, rev3_key), (rev2_key,), (rev3_key,), ], anns, ) self.assertEqualDiff(spec_text, b"".join(lines)) class KnitTests(TestCaseWithMemoryTransport): """Class containing knit test helper routines.""" def make_test_knit(self, annotate=False, name="test"): mapper = ConstantMapper(name) return make_file_factory(annotate, mapper)(self.get_transport()) class TestBadShaError(KnitTests): """Tests for handling of sha errors.""" def test_sha_exception_has_text(self): # having the failed text included in the error allows for recovery. source = self.make_test_knit() target = self.make_test_knit(name="target") if not source._max_delta_chain: raise TestNotApplicable( "cannot get delta-caused sha failures without deltas." ) # create a basis basis = (b"basis",) broken = (b"broken",) source.add_lines(basis, (), [b"foo\n"]) source.add_lines(broken, (basis,), [b"foo\n", b"bar\n"]) # Seed target with a bad basis text target.add_lines(basis, (), [b"gam\n"]) target.insert_record_stream( source.get_record_stream([broken], "unordered", False) ) err = self.assertRaises( KnitCorrupt, next(target.get_record_stream([broken], "unordered", True)).get_bytes_as, "chunked", ) self.assertEqual([b"gam\n", b"bar\n"], err.content) # Test for formatting with live data self.assertStartsWith(str(err), "Knit ") class TestKnitIndex(KnitTests): def test_add_versions_dictionary_compresses(self): """Adding versions to the index should update the lookup dict.""" knit = self.make_test_knit() idx = knit._index idx.add_records([((b"a-1",), [b"fulltext"], ((b"a-1",), 0, 0), [])]) self.check_file_contents( "test.kndx", b"# bzr knit index 8\n\na-1 fulltext 0 0 :" ) idx.add_records( [ ((b"a-2",), [b"fulltext"], ((b"a-2",), 0, 0), [(b"a-1",)]), ((b"a-3",), [b"fulltext"], ((b"a-3",), 0, 0), [(b"a-2",)]), ] ) self.check_file_contents( "test.kndx", b"# bzr knit index 8\n" b"\n" b"a-1 fulltext 0 0 :\n" b"a-2 fulltext 0 0 0 :\n" b"a-3 fulltext 0 0 1 :", ) self.assertEqual({(b"a-3",), (b"a-1",), (b"a-2",)}, idx.keys()) self.assertEqual( { (b"a-1",): (((b"a-1",), 0, 0), None, (), ("fulltext", False)), (b"a-2",): (((b"a-2",), 0, 0), None, ((b"a-1",),), ("fulltext", False)), (b"a-3",): (((b"a-3",), 0, 0), None, ((b"a-2",),), ("fulltext", False)), }, idx.get_build_details(idx.keys()), ) self.assertEqual( { (b"a-1",): (), (b"a-2",): ((b"a-1",),), (b"a-3",): ((b"a-2",),), }, idx.get_parent_map(idx.keys()), ) def test_add_versions_fails_clean(self): """If add_versions fails in the middle, it restores a pristine state. Any modifications that are made to the index are reset if all versions cannot be added. """ # This cheats a little bit by passing in a generator which will # raise an exception before the processing finishes # Other possibilities would be to have an version with the wrong number # of entries, or to make the backing transport unable to write any # files. knit = self.make_test_knit() idx = knit._index idx.add_records([((b"a-1",), [b"fulltext"], ((b"a-1",), 0, 0), [])]) class StopEarly(Exception): pass def generate_failure(): """Add some entries and then raise an exception.""" yield ((b"a-2",), [b"fulltext"], (None, 0, 0), (b"a-1",)) yield ((b"a-3",), [b"fulltext"], (None, 0, 0), (b"a-2",)) raise StopEarly() # Assert the pre-condition def assertA1Only(): self.assertEqual({(b"a-1",)}, set(idx.keys())) self.assertEqual( {(b"a-1",): (((b"a-1",), 0, 0), None, (), ("fulltext", False))}, idx.get_build_details([(b"a-1",)]), ) self.assertEqual({(b"a-1",): ()}, idx.get_parent_map(idx.keys())) assertA1Only() self.assertRaises(StopEarly, idx.add_records, generate_failure()) # And it shouldn't be modified assertA1Only() def test_knit_index_ignores_empty_files(self): # There was a race condition in older bzr, where a ^C at the right time # could leave an empty .kndx file, which bzr would later claim was a # corrupted file since the header was not present. In reality, the file # just wasn't created, so it should be ignored. t = self.get_transport() t.put_bytes("test.kndx", b"") self.make_test_knit() def test_knit_index_checks_header(self): t = self.get_transport() t.put_bytes("test.kndx", b"# not really a knit header\n\n") k = self.make_test_knit() self.assertRaises(KnitHeaderError, k.keys) class TestGraphIndexKnit(KnitTests): """Tests for knits using a GraphIndex rather than a KnitIndex.""" def make_g_index(self, name, ref_lists=0, nodes=None): if nodes is None: nodes = [] builder = GraphIndexBuilder(ref_lists) for node, references, value in nodes: builder.add_node(node, references, value) stream = builder.finish() trans = self.get_transport() size = trans.put_file(name, stream) return GraphIndex(trans, name, size) def two_graph_index(self, deltas=False, catch_adds=False): """Build a two-graph index. :param deltas: If true, use underlying indices with two node-ref lists and 'parent' set to a delta-compressed against tail. """ # build a complex graph across several indices. if deltas: # delta compression inn the index index1 = self.make_g_index( "1", 2, [ ( (b"tip",), b"N0 100", ( [(b"parent",)], [], ), ), ((b"tail",), b"", ([], [])), ], ) index2 = self.make_g_index( "2", 2, [ ( (b"parent",), b" 100 78", ([(b"tail",), (b"ghost",)], [(b"tail",)]), ), ((b"separate",), b"", ([], [])), ], ) else: # just blob location and graph in the index. index1 = self.make_g_index( "1", 1, [((b"tip",), b"N0 100", ([(b"parent",)],)), ((b"tail",), b"", ([],))], ) index2 = self.make_g_index( "2", 1, [ ((b"parent",), b" 100 78", ([(b"tail",), (b"ghost",)],)), ((b"separate",), b"", ([],)), ], ) combined_index = CombinedGraphIndex([index1, index2]) if catch_adds: self.combined_index = combined_index self.caught_entries = [] add_callback = self.catch_add else: add_callback = None return _KnitGraphIndex( combined_index, lambda: True, deltas=deltas, add_callback=add_callback ) def test_keys(self): index = self.two_graph_index() self.assertEqual( {(b"tail",), (b"tip",), (b"parent",), (b"separate",)}, set(index.keys()) ) def test_get_position(self): index = self.two_graph_index() self.assertEqual( (index._graph_index._indices[0], 0, 100), index.get_position((b"tip",)) ) self.assertEqual( (index._graph_index._indices[1], 100, 78), index.get_position((b"parent",)) ) def test_get_method_deltas(self): index = self.two_graph_index(deltas=True) self.assertEqual("fulltext", index.get_method((b"tip",))) self.assertEqual("line-delta", index.get_method((b"parent",))) def test_get_method_no_deltas(self): # check that the parent-history lookup is ignored with deltas=False. index = self.two_graph_index(deltas=False) self.assertEqual("fulltext", index.get_method((b"tip",))) self.assertEqual("fulltext", index.get_method((b"parent",))) def test_get_options_deltas(self): index = self.two_graph_index(deltas=True) self.assertEqual([b"fulltext", b"no-eol"], index.get_options((b"tip",))) self.assertEqual([b"line-delta"], index.get_options((b"parent",))) def test_get_options_no_deltas(self): # check that the parent-history lookup is ignored with deltas=False. index = self.two_graph_index(deltas=False) self.assertEqual([b"fulltext", b"no-eol"], index.get_options((b"tip",))) self.assertEqual([b"fulltext"], index.get_options((b"parent",))) def test_get_parent_map(self): index = self.two_graph_index() self.assertEqual( {(b"parent",): ((b"tail",), (b"ghost",))}, index.get_parent_map([(b"parent",), (b"ghost",)]), ) def catch_add(self, entries): self.caught_entries.append(entries) def test_add_no_callback_errors(self): index = self.two_graph_index() self.assertRaises( ReadOnlyError, index.add_records, [((b"new",), b"fulltext,no-eol", (None, 50, 60), [b"separate"])], ) def test_add_version_smoke(self): index = self.two_graph_index(catch_adds=True) index.add_records( [((b"new",), b"fulltext,no-eol", (None, 50, 60), [(b"separate",)])] ) self.assertEqual( [[((b"new",), b"N50 60", (((b"separate",),),))]], self.caught_entries ) def test_add_version_delta_not_delta_index(self): index = self.two_graph_index(catch_adds=True) self.assertRaises( KnitCorrupt, index.add_records, [((b"new",), b"no-eol,line-delta", (None, 0, 100), [(b"parent",)])], ) self.assertEqual([], self.caught_entries) def test_add_version_same_dup(self): index = self.two_graph_index(catch_adds=True) # options can be spelt two different ways index.add_records( [((b"tip",), b"fulltext,no-eol", (None, 0, 100), [(b"parent",)])] ) index.add_records( [((b"tip",), b"no-eol,fulltext", (None, 0, 100), [(b"parent",)])] ) # position/length are ignored (because each pack could have fulltext or # delta, and be at a different position. index.add_records( [((b"tip",), b"fulltext,no-eol", (None, 50, 100), [(b"parent",)])] ) index.add_records( [((b"tip",), b"fulltext,no-eol", (None, 0, 1000), [(b"parent",)])] ) # but neither should have added data: self.assertEqual([[], [], [], []], self.caught_entries) def test_add_version_different_dup(self): index = self.two_graph_index(deltas=True, catch_adds=True) # change options self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"line-delta", (None, 0, 100), [(b"parent",)])], ) self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext", (None, 0, 100), [(b"parent",)])], ) # parents self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext,no-eol", (None, 0, 100), [])], ) self.assertEqual([], self.caught_entries) def test_add_versions_nodeltas(self): index = self.two_graph_index(catch_adds=True) index.add_records( [ ((b"new",), b"fulltext,no-eol", (None, 50, 60), [(b"separate",)]), ((b"new2",), b"fulltext", (None, 0, 6), [(b"new",)]), ] ) self.assertEqual( [ ((b"new",), b"N50 60", (((b"separate",),),)), ((b"new2",), b" 0 6", (((b"new",),),)), ], sorted(self.caught_entries[0]), ) self.assertEqual(1, len(self.caught_entries)) def test_add_versions_deltas(self): index = self.two_graph_index(deltas=True, catch_adds=True) index.add_records( [ ((b"new",), b"fulltext,no-eol", (None, 50, 60), [(b"separate",)]), ((b"new2",), b"line-delta", (None, 0, 6), [(b"new",)]), ] ) self.assertEqual( [ ((b"new",), b"N50 60", (((b"separate",),), ())), ( (b"new2",), b" 0 6", ( ((b"new",),), ((b"new",),), ), ), ], sorted(self.caught_entries[0]), ) self.assertEqual(1, len(self.caught_entries)) def test_add_versions_delta_not_delta_index(self): index = self.two_graph_index(catch_adds=True) self.assertRaises( KnitCorrupt, index.add_records, [((b"new",), b"no-eol,line-delta", (None, 0, 100), [(b"parent",)])], ) self.assertEqual([], self.caught_entries) def test_add_versions_random_id_accepted(self): index = self.two_graph_index(catch_adds=True) index.add_records([], random_id=True) def test_add_versions_same_dup(self): index = self.two_graph_index(catch_adds=True) # options can be spelt two different ways index.add_records( [((b"tip",), b"fulltext,no-eol", (None, 0, 100), [(b"parent",)])] ) index.add_records( [((b"tip",), b"no-eol,fulltext", (None, 0, 100), [(b"parent",)])] ) # position/length are ignored (because each pack could have fulltext or # delta, and be at a different position. index.add_records( [((b"tip",), b"fulltext,no-eol", (None, 50, 100), [(b"parent",)])] ) index.add_records( [((b"tip",), b"fulltext,no-eol", (None, 0, 1000), [(b"parent",)])] ) # but neither should have added data. self.assertEqual([[], [], [], []], self.caught_entries) def test_add_versions_different_dup(self): index = self.two_graph_index(deltas=True, catch_adds=True) # change options self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"line-delta", (None, 0, 100), [(b"parent",)])], ) self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext", (None, 0, 100), [(b"parent",)])], ) # parents self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext,no-eol", (None, 0, 100), [])], ) # change options in the second record self.assertRaises( KnitCorrupt, index.add_records, [ ((b"tip",), b"fulltext,no-eol", (None, 0, 100), [(b"parent",)]), ((b"tip",), b"line-delta", (None, 0, 100), [(b"parent",)]), ], ) self.assertEqual([], self.caught_entries) def make_g_index_missing_compression_parent(self): graph_index = self.make_g_index( "missing_comp", 2, [ ( (b"tip",), b" 100 78", ([(b"missing-parent",), (b"ghost",)], [(b"missing-parent",)]), ) ], ) return graph_index def make_g_index_missing_parent(self): graph_index = self.make_g_index( "missing_parent", 2, [ ((b"parent",), b" 100 78", ([], [])), ( (b"tip",), b" 100 78", ([(b"parent",), (b"missing-parent",)], [(b"parent",)]), ), ], ) return graph_index def make_g_index_no_external_refs(self): graph_index = self.make_g_index( "no_external_refs", 2, [((b"rev",), b" 100 78", ([(b"parent",), (b"ghost",)], []))], ) return graph_index def test_add_good_unvalidated_index(self): unvalidated = self.make_g_index_no_external_refs() combined = CombinedGraphIndex([unvalidated]) index = _KnitGraphIndex(combined, lambda: True, deltas=True) index.scan_unvalidated_index(unvalidated) self.assertEqual(frozenset(), index.get_missing_compression_parents()) def test_add_missing_compression_parent_unvalidated_index(self): unvalidated = self.make_g_index_missing_compression_parent() combined = CombinedGraphIndex([unvalidated]) index = _KnitGraphIndex(combined, lambda: True, deltas=True) index.scan_unvalidated_index(unvalidated) # This also checks that its only the compression parent that is # examined, otherwise 'ghost' would also be reported as a missing # parent. self.assertEqual( frozenset([(b"missing-parent",)]), index.get_missing_compression_parents() ) def test_add_missing_noncompression_parent_unvalidated_index(self): unvalidated = self.make_g_index_missing_parent() combined = CombinedGraphIndex([unvalidated]) index = _KnitGraphIndex( combined, lambda: True, deltas=True, track_external_parent_refs=True ) index.scan_unvalidated_index(unvalidated) self.assertEqual(frozenset([(b"missing-parent",)]), index.get_missing_parents()) def test_track_external_parent_refs(self): g_index = self.make_g_index("empty", 2, []) combined = CombinedGraphIndex([g_index]) index = _KnitGraphIndex( combined, lambda: True, deltas=True, add_callback=self.catch_add, track_external_parent_refs=True, ) self.caught_entries = [] index.add_records( [ ( (b"new-key",), b"fulltext,no-eol", (None, 50, 60), [(b"parent-1",), (b"parent-2",)], ) ] ) self.assertEqual( frozenset([(b"parent-1",), (b"parent-2",)]), index.get_missing_parents() ) def test_add_unvalidated_index_with_present_external_references(self): index = self.two_graph_index(deltas=True) # Ugly hack to get at one of the underlying GraphIndex objects that # two_graph_index built. unvalidated = index._graph_index._indices[1] # 'parent' is an external ref of _indices[1] (unvalidated), but is # present in _indices[0]. index.scan_unvalidated_index(unvalidated) self.assertEqual(frozenset(), index.get_missing_compression_parents()) def make_new_missing_parent_g_index(self, name): missing_parent = name.encode("ascii") + b"-missing-parent" graph_index = self.make_g_index( name, 2, [ ( (name.encode("ascii") + b"tip",), b" 100 78", ([(missing_parent,), (b"ghost",)], [(missing_parent,)]), ) ], ) return graph_index def test_add_mulitiple_unvalidated_indices_with_missing_parents(self): g_index_1 = self.make_new_missing_parent_g_index("one") g_index_2 = self.make_new_missing_parent_g_index("two") combined = CombinedGraphIndex([g_index_1, g_index_2]) index = _KnitGraphIndex(combined, lambda: True, deltas=True) index.scan_unvalidated_index(g_index_1) index.scan_unvalidated_index(g_index_2) self.assertEqual( frozenset([(b"one-missing-parent",), (b"two-missing-parent",)]), index.get_missing_compression_parents(), ) def test_add_mulitiple_unvalidated_indices_with_mutual_dependencies(self): graph_index_a = self.make_g_index( "one", 2, [ ((b"parent-one",), b" 100 78", ([(b"non-compression-parent",)], [])), ( (b"child-of-two",), b" 100 78", ([(b"parent-two",)], [(b"parent-two",)]), ), ], ) graph_index_b = self.make_g_index( "two", 2, [ ((b"parent-two",), b" 100 78", ([(b"non-compression-parent",)], [])), ( (b"child-of-one",), b" 100 78", ([(b"parent-one",)], [(b"parent-one",)]), ), ], ) combined = CombinedGraphIndex([graph_index_a, graph_index_b]) index = _KnitGraphIndex(combined, lambda: True, deltas=True) index.scan_unvalidated_index(graph_index_a) index.scan_unvalidated_index(graph_index_b) self.assertEqual(frozenset([]), index.get_missing_compression_parents()) class TestNoParentsGraphIndexKnit(KnitTests): """Tests for knits using _KnitGraphIndex with no parents.""" def make_g_index(self, name, ref_lists=0, nodes=None): if nodes is None: nodes = [] builder = GraphIndexBuilder(ref_lists) for node, references in nodes: builder.add_node(node, references) stream = builder.finish() trans = self.get_transport() size = trans.put_file(name, stream) return GraphIndex(trans, name, size) def test_add_good_unvalidated_index(self): unvalidated = self.make_g_index("unvalidated") combined = CombinedGraphIndex([unvalidated]) index = _KnitGraphIndex(combined, lambda: True, parents=False) index.scan_unvalidated_index(unvalidated) self.assertEqual(frozenset(), index.get_missing_compression_parents()) def test_parents_deltas_incompatible(self): index = CombinedGraphIndex([]) self.assertRaises( knit.KnitError, _KnitGraphIndex, lambda: True, index, deltas=True, parents=False, ) def two_graph_index(self, catch_adds=False): """Build a two-graph index. :param deltas: If true, use underlying indices with two node-ref lists and 'parent' set to a delta-compressed against tail. """ # put several versions in the index. index1 = self.make_g_index("1", 0, [((b"tip",), b"N0 100"), ((b"tail",), b"")]) index2 = self.make_g_index( "2", 0, [((b"parent",), b" 100 78"), ((b"separate",), b"")] ) combined_index = CombinedGraphIndex([index1, index2]) if catch_adds: self.combined_index = combined_index self.caught_entries = [] add_callback = self.catch_add else: add_callback = None return _KnitGraphIndex( combined_index, lambda: True, parents=False, add_callback=add_callback ) def test_keys(self): index = self.two_graph_index() self.assertEqual( {(b"tail",), (b"tip",), (b"parent",), (b"separate",)}, set(index.keys()) ) def test_get_position(self): index = self.two_graph_index() self.assertEqual( (index._graph_index._indices[0], 0, 100), index.get_position((b"tip",)) ) self.assertEqual( (index._graph_index._indices[1], 100, 78), index.get_position((b"parent",)) ) def test_get_method(self): index = self.two_graph_index() self.assertEqual("fulltext", index.get_method((b"tip",))) self.assertEqual([b"fulltext"], index.get_options((b"parent",))) def test_get_options(self): index = self.two_graph_index() self.assertEqual([b"fulltext", b"no-eol"], index.get_options((b"tip",))) self.assertEqual([b"fulltext"], index.get_options((b"parent",))) def test_get_parent_map(self): index = self.two_graph_index() self.assertEqual( {(b"parent",): None}, index.get_parent_map([(b"parent",), (b"ghost",)]) ) def catch_add(self, entries): self.caught_entries.append(entries) def test_add_no_callback_errors(self): index = self.two_graph_index() self.assertRaises( ReadOnlyError, index.add_records, [((b"new",), b"fulltext,no-eol", (None, 50, 60), [(b"separate",)])], ) def test_add_version_smoke(self): index = self.two_graph_index(catch_adds=True) index.add_records([((b"new",), b"fulltext,no-eol", (None, 50, 60), [])]) self.assertEqual([[((b"new",), b"N50 60")]], self.caught_entries) def test_add_version_delta_not_delta_index(self): index = self.two_graph_index(catch_adds=True) self.assertRaises( KnitCorrupt, index.add_records, [((b"new",), b"no-eol,line-delta", (None, 0, 100), [])], ) self.assertEqual([], self.caught_entries) def test_add_version_same_dup(self): index = self.two_graph_index(catch_adds=True) # options can be spelt two different ways index.add_records([((b"tip",), b"fulltext,no-eol", (None, 0, 100), [])]) index.add_records([((b"tip",), b"no-eol,fulltext", (None, 0, 100), [])]) # position/length are ignored (because each pack could have fulltext or # delta, and be at a different position. index.add_records([((b"tip",), b"fulltext,no-eol", (None, 50, 100), [])]) index.add_records([((b"tip",), b"fulltext,no-eol", (None, 0, 1000), [])]) # but neither should have added data. self.assertEqual([[], [], [], []], self.caught_entries) def test_add_version_different_dup(self): index = self.two_graph_index(catch_adds=True) # change options self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"no-eol,line-delta", (None, 0, 100), [])], ) self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"line-delta,no-eol", (None, 0, 100), [])], ) self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext", (None, 0, 100), [])], ) # parents self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext,no-eol", (None, 0, 100), [(b"parent",)])], ) self.assertEqual([], self.caught_entries) def test_add_versions(self): index = self.two_graph_index(catch_adds=True) index.add_records( [ ((b"new",), b"fulltext,no-eol", (None, 50, 60), []), ((b"new2",), b"fulltext", (None, 0, 6), []), ] ) self.assertEqual( [((b"new",), b"N50 60"), ((b"new2",), b" 0 6")], sorted(self.caught_entries[0]), ) self.assertEqual(1, len(self.caught_entries)) def test_add_versions_delta_not_delta_index(self): index = self.two_graph_index(catch_adds=True) self.assertRaises( KnitCorrupt, index.add_records, [((b"new",), b"no-eol,line-delta", (None, 0, 100), [(b"parent",)])], ) self.assertEqual([], self.caught_entries) def test_add_versions_parents_not_parents_index(self): index = self.two_graph_index(catch_adds=True) self.assertRaises( KnitCorrupt, index.add_records, [((b"new",), b"no-eol,fulltext", (None, 0, 100), [(b"parent",)])], ) self.assertEqual([], self.caught_entries) def test_add_versions_random_id_accepted(self): index = self.two_graph_index(catch_adds=True) index.add_records([], random_id=True) def test_add_versions_same_dup(self): index = self.two_graph_index(catch_adds=True) # options can be spelt two different ways index.add_records([((b"tip",), b"fulltext,no-eol", (None, 0, 100), [])]) index.add_records([((b"tip",), b"no-eol,fulltext", (None, 0, 100), [])]) # position/length are ignored (because each pack could have fulltext or # delta, and be at a different position. index.add_records([((b"tip",), b"fulltext,no-eol", (None, 50, 100), [])]) index.add_records([((b"tip",), b"fulltext,no-eol", (None, 0, 1000), [])]) # but neither should have added data. self.assertEqual([[], [], [], []], self.caught_entries) def test_add_versions_different_dup(self): index = self.two_graph_index(catch_adds=True) # change options self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"no-eol,line-delta", (None, 0, 100), [])], ) self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"line-delta,no-eol", (None, 0, 100), [])], ) self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext", (None, 0, 100), [])], ) # parents self.assertRaises( KnitCorrupt, index.add_records, [((b"tip",), b"fulltext,no-eol", (None, 0, 100), [(b"parent",)])], ) # change options in the second record self.assertRaises( KnitCorrupt, index.add_records, [ ((b"tip",), b"fulltext,no-eol", (None, 0, 100), []), ((b"tip",), b"no-eol,line-delta", (None, 0, 100), []), ], ) self.assertEqual([], self.caught_entries) class TestKnitVersionedFiles(KnitTests): def assertGroupKeysForIo( self, exp_groups, keys, non_local_keys, positions, _min_buffer_size=None ): kvf = self.make_test_knit() if _min_buffer_size is None: _min_buffer_size = knit._STREAM_MIN_BUFFER_SIZE self.assertEqual( exp_groups, kvf._group_keys_for_io( keys, non_local_keys, positions, _min_buffer_size=_min_buffer_size ), ) def assertSplitByPrefix(self, expected_map, expected_prefix_order, keys): split, prefix_order = KnitVersionedFiles._split_by_prefix(keys) self.assertEqual(expected_map, split) self.assertEqual(expected_prefix_order, prefix_order) def test__group_keys_for_io(self): ft_detail = ("fulltext", False) ld_detail = ("line-delta", False) f_a = (b"f", b"a") f_b = (b"f", b"b") f_c = (b"f", b"c") g_a = (b"g", b"a") g_b = (b"g", b"b") g_c = (b"g", b"c") positions = { f_a: (ft_detail, (f_a, 0, 100), None), f_b: (ld_detail, (f_b, 100, 21), f_a), f_c: (ld_detail, (f_c, 180, 15), f_b), g_a: (ft_detail, (g_a, 121, 35), None), g_b: (ld_detail, (g_b, 156, 12), g_a), g_c: (ld_detail, (g_c, 195, 13), g_a), } self.assertGroupKeysForIo([([f_a], set())], [f_a], [], positions) self.assertGroupKeysForIo([([f_a], {f_a})], [f_a], [f_a], positions) self.assertGroupKeysForIo([([f_a, f_b], set())], [f_a, f_b], [], positions) self.assertGroupKeysForIo([([f_a, f_b], {f_b})], [f_a, f_b], [f_b], positions) self.assertGroupKeysForIo( [([f_a, f_b, g_a, g_b], set())], [f_a, g_a, f_b, g_b], [], positions ) self.assertGroupKeysForIo( [([f_a, f_b, g_a, g_b], set())], [f_a, g_a, f_b, g_b], [], positions, _min_buffer_size=150, ) self.assertGroupKeysForIo( [([f_a, f_b], set()), ([g_a, g_b], set())], [f_a, g_a, f_b, g_b], [], positions, _min_buffer_size=100, ) self.assertGroupKeysForIo( [([f_c], set()), ([g_b], set())], [f_c, g_b], [], positions, _min_buffer_size=125, ) self.assertGroupKeysForIo( [([g_b, f_c], set())], [g_b, f_c], [], positions, _min_buffer_size=125 ) def test__split_by_prefix(self): self.assertSplitByPrefix( { b"f": [(b"f", b"a"), (b"f", b"b")], b"g": [(b"g", b"b"), (b"g", b"a")], }, [b"f", b"g"], [(b"f", b"a"), (b"g", b"b"), (b"g", b"a"), (b"f", b"b")], ) self.assertSplitByPrefix( { b"f": [(b"f", b"a"), (b"f", b"b")], b"g": [(b"g", b"b"), (b"g", b"a")], }, [b"f", b"g"], [(b"f", b"a"), (b"f", b"b"), (b"g", b"b"), (b"g", b"a")], ) self.assertSplitByPrefix( { b"f": [(b"f", b"a"), (b"f", b"b")], b"g": [(b"g", b"b"), (b"g", b"a")], }, [b"f", b"g"], [(b"f", b"a"), (b"f", b"b"), (b"g", b"b"), (b"g", b"a")], ) self.assertSplitByPrefix( { b"f": [(b"f", b"a"), (b"f", b"b")], b"g": [(b"g", b"b"), (b"g", b"a")], b"": [(b"a",), (b"b",)], }, [b"f", b"g", b""], [(b"f", b"a"), (b"g", b"b"), (b"a",), (b"b",), (b"g", b"a"), (b"f", b"b")], ) class TestStacking(KnitTests): def get_basis_and_test_knit(self): basis = self.make_test_knit(name="basis") basis = RecordingVersionedFilesDecorator(basis) test = self.make_test_knit(name="test") test.add_fallback_versioned_files(basis) return basis, test def test_add_fallback_versioned_files(self): basis = self.make_test_knit(name="basis") test = self.make_test_knit(name="test") # It must not error; other tests test that the fallback is referred to # when accessing data. test.add_fallback_versioned_files(basis) def test_add_lines(self): # lines added to the test are not added to the basis basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) key_cross_border = (b"quux",) key_delta = (b"zaphod",) test.add_lines(key, (), [b"foo\n"]) self.assertEqual({}, basis.get_parent_map([key])) # lines added to the test that reference across the stack do a # fulltext. basis.add_lines(key_basis, (), [b"foo\n"]) basis.calls = [] test.add_lines(key_cross_border, (key_basis,), [b"foo\n"]) self.assertEqual("fulltext", test._index.get_method(key_cross_border)) # we don't even need to look at the basis to see that this should be # stored as a fulltext self.assertEqual([], basis.calls) # Subsequent adds do delta. basis.calls = [] test.add_lines(key_delta, (key_cross_border,), [b"foo\n"]) self.assertEqual("line-delta", test._index.get_method(key_delta)) self.assertEqual([], basis.calls) def test_annotate(self): # annotations from the test knit are answered without asking the basis basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) test.add_lines(key, (), [b"foo\n"]) details = test.annotate(key) self.assertEqual([(key, b"foo\n")], details) self.assertEqual([], basis.calls) # But texts that are not in the test knit are looked for in the basis # directly. basis.add_lines(key_basis, (), [b"foo\n", b"bar\n"]) basis.calls = [] details = test.annotate(key_basis) self.assertEqual([(key_basis, b"foo\n"), (key_basis, b"bar\n")], details) # Not optimised to date: # self.assertEqual([("annotate", key_basis)], basis.calls) self.assertEqual( [ ("get_parent_map", {key_basis}), ("get_parent_map", {key_basis}), ("get_record_stream", [key_basis], "topological", True), ], basis.calls, ) def test_check(self): # At the moment checking a stacked knit does implicitly check the # fallback files. _basis, test = self.get_basis_and_test_knit() test.check() def test_get_parent_map(self): # parents in the test knit are answered without asking the basis basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) key_missing = (b"missing",) test.add_lines(key, (), []) parent_map = test.get_parent_map([key]) self.assertEqual({key: ()}, parent_map) self.assertEqual([], basis.calls) # But parents that are not in the test knit are looked for in the basis basis.add_lines(key_basis, (), []) basis.calls = [] parent_map = test.get_parent_map([key, key_basis, key_missing]) self.assertEqual({key: (), key_basis: ()}, parent_map) self.assertEqual([("get_parent_map", {key_basis, key_missing})], basis.calls) def test_get_record_stream_unordered_fulltexts(self): # records from the test knit are answered without asking the basis: basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) key_missing = (b"missing",) test.add_lines(key, (), [b"foo\n"]) records = list(test.get_record_stream([key], "unordered", True)) self.assertEqual(1, len(records)) self.assertEqual([], basis.calls) # Missing (from test knit) objects are retrieved from the basis: basis.add_lines(key_basis, (), [b"foo\n", b"bar\n"]) basis.calls = [] records = list( test.get_record_stream([key_basis, key_missing], "unordered", True) ) self.assertEqual(2, len(records)) calls = list(basis.calls) for record in records: self.assertSubset([record.key], (key_basis, key_missing)) if record.key == key_missing: self.assertIsInstance(record, AbsentContentFactory) else: reference = list( basis.get_record_stream([key_basis], "unordered", True) )[0] self.assertEqual(reference.key, record.key) self.assertEqual(reference.sha1, record.sha1) self.assertEqual(reference.storage_kind, record.storage_kind) self.assertEqual( reference.get_bytes_as(reference.storage_kind), record.get_bytes_as(record.storage_kind), ) self.assertEqual( reference.get_bytes_as("fulltext"), record.get_bytes_as("fulltext") ) # It's not strictly minimal, but it seems reasonable for now for it to # ask which fallbacks have which parents. self.assertEqual( [ ("get_parent_map", {key_basis, key_missing}), ("get_record_stream", [key_basis], "unordered", True), ], calls, ) def test_get_record_stream_ordered_fulltexts(self): # ordering is preserved down into the fallback store. basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) key_basis_2 = (b"quux",) key_missing = (b"missing",) test.add_lines(key, (key_basis,), [b"foo\n"]) # Missing (from test knit) objects are retrieved from the basis: basis.add_lines(key_basis, (key_basis_2,), [b"foo\n", b"bar\n"]) basis.add_lines(key_basis_2, (), [b"quux\n"]) basis.calls = [] # ask for in non-topological order records = list( test.get_record_stream( [key, key_basis, key_missing, key_basis_2], "topological", True ) ) self.assertEqual(4, len(records)) results = [] for record in records: self.assertSubset([record.key], (key_basis, key_missing, key_basis_2, key)) if record.key == key_missing: self.assertIsInstance(record, AbsentContentFactory) else: results.append( ( record.key, record.sha1, record.storage_kind, record.get_bytes_as("fulltext"), ) ) calls = list(basis.calls) order = [record[0] for record in results] self.assertEqual([key_basis_2, key_basis, key], order) for result in results: source = test if result[0] == key else basis record = next(source.get_record_stream([result[0]], "unordered", True)) self.assertEqual(record.key, result[0]) self.assertEqual(record.sha1, result[1]) # We used to check that the storage kind matched, but actually it # depends on whether it was sourced from the basis, or in a single # group, because asking for full texts returns proxy objects to a # _ContentMapGenerator object; so checking the kind is unneeded. self.assertEqual(record.get_bytes_as("fulltext"), result[3]) # It's not strictly minimal, but it seems reasonable for now for it to # ask which fallbacks have which parents. self.assertEqual(2, len(calls)) self.assertEqual( ("get_parent_map", {key_basis, key_basis_2, key_missing}), calls[0] ) # topological is requested from the fallback, because that is what # was requested at the top level. self.assertIn( calls[1], [ ("get_record_stream", [key_basis_2, key_basis], "topological", True), ("get_record_stream", [key_basis, key_basis_2], "topological", True), ], ) def test_get_record_stream_unordered_deltas(self): # records from the test knit are answered without asking the basis: basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) key_missing = (b"missing",) test.add_lines(key, (), [b"foo\n"]) records = list(test.get_record_stream([key], "unordered", False)) self.assertEqual(1, len(records)) self.assertEqual([], basis.calls) # Missing (from test knit) objects are retrieved from the basis: basis.add_lines(key_basis, (), [b"foo\n", b"bar\n"]) basis.calls = [] records = list( test.get_record_stream([key_basis, key_missing], "unordered", False) ) self.assertEqual(2, len(records)) calls = list(basis.calls) for record in records: self.assertSubset([record.key], (key_basis, key_missing)) if record.key == key_missing: self.assertIsInstance(record, AbsentContentFactory) else: reference = list( basis.get_record_stream([key_basis], "unordered", False) )[0] self.assertEqual(reference.key, record.key) self.assertEqual(reference.sha1, record.sha1) self.assertEqual(reference.storage_kind, record.storage_kind) self.assertEqual( reference.get_bytes_as(reference.storage_kind), record.get_bytes_as(record.storage_kind), ) # It's not strictly minimal, but it seems reasonable for now for it to # ask which fallbacks have which parents. self.assertEqual( [ ("get_parent_map", {key_basis, key_missing}), ("get_record_stream", [key_basis], "unordered", False), ], calls, ) def test_get_record_stream_ordered_deltas(self): # ordering is preserved down into the fallback store. basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) key_basis_2 = (b"quux",) key_missing = (b"missing",) test.add_lines(key, (key_basis,), [b"foo\n"]) # Missing (from test knit) objects are retrieved from the basis: basis.add_lines(key_basis, (key_basis_2,), [b"foo\n", b"bar\n"]) basis.add_lines(key_basis_2, (), [b"quux\n"]) basis.calls = [] # ask for in non-topological order records = list( test.get_record_stream( [key, key_basis, key_missing, key_basis_2], "topological", False ) ) self.assertEqual(4, len(records)) results = [] for record in records: self.assertSubset([record.key], (key_basis, key_missing, key_basis_2, key)) if record.key == key_missing: self.assertIsInstance(record, AbsentContentFactory) else: results.append( ( record.key, record.sha1, record.storage_kind, record.get_bytes_as(record.storage_kind), ) ) calls = list(basis.calls) order = [record[0] for record in results] self.assertEqual([key_basis_2, key_basis, key], order) for result in results: source = test if result[0] == key else basis record = next(source.get_record_stream([result[0]], "unordered", False)) self.assertEqual(record.key, result[0]) self.assertEqual(record.sha1, result[1]) self.assertEqual(record.storage_kind, result[2]) self.assertEqual(record.get_bytes_as(record.storage_kind), result[3]) # It's not strictly minimal, but it seems reasonable for now for it to # ask which fallbacks have which parents. self.assertEqual( [ ("get_parent_map", {key_basis, key_basis_2, key_missing}), ("get_record_stream", [key_basis_2, key_basis], "topological", False), ], calls, ) def test_get_sha1s(self): # sha1's in the test knit are answered without asking the basis basis, test = self.get_basis_and_test_knit() key = (b"foo",) key_basis = (b"bar",) key_missing = (b"missing",) test.add_lines(key, (), [b"foo\n"]) key_sha1sum = osutils.sha_string(b"foo\n") sha1s = test.get_sha1s([key]) self.assertEqual({key: key_sha1sum}, sha1s) self.assertEqual([], basis.calls) # But texts that are not in the test knit are looked for in the basis # directly (rather than via text reconstruction) so that remote servers # etc don't have to answer with full content. basis.add_lines(key_basis, (), [b"foo\n", b"bar\n"]) basis_sha1sum = osutils.sha_string(b"foo\nbar\n") basis.calls = [] sha1s = test.get_sha1s([key, key_missing, key_basis]) self.assertEqual({key: key_sha1sum, key_basis: basis_sha1sum}, sha1s) self.assertEqual([("get_sha1s", {key_basis, key_missing})], basis.calls) def test_insert_record_stream(self): # records are inserted as normal; insert_record_stream builds on # add_lines, so a smoke test should be all that's needed: key_basis = (b"bar",) key_delta = (b"zaphod",) basis, test = self.get_basis_and_test_knit() source = self.make_test_knit(name="source") basis.add_lines(key_basis, (), [b"foo\n"]) basis.calls = [] source.add_lines(key_basis, (), [b"foo\n"]) source.add_lines(key_delta, (key_basis,), [b"bar\n"]) stream = source.get_record_stream([key_delta], "unordered", False) test.insert_record_stream(stream) # XXX: this does somewhat too many calls in making sure of whether it # has to recreate the full text. self.assertEqual( [ ("get_parent_map", {key_basis}), ("get_parent_map", {key_basis}), ("get_record_stream", [key_basis], "unordered", True), ], basis.calls, ) self.assertEqual({key_delta: (key_basis,)}, test.get_parent_map([key_delta])) self.assertEqual( b"bar\n", next(test.get_record_stream([key_delta], "unordered", True)).get_bytes_as( "fulltext" ), ) def test_iter_lines_added_or_present_in_keys(self): # Lines from the basis are returned, and lines for a given key are only # returned once. key1 = (b"foo1",) key2 = (b"foo2",) # all sources are asked for keys: basis, test = self.get_basis_and_test_knit() basis.add_lines(key1, (), [b"foo"]) basis.calls = [] lines = list(test.iter_lines_added_or_present_in_keys([key1])) self.assertEqual([(b"foo\n", key1)], lines) self.assertEqual([("iter_lines_added_or_present_in_keys", {key1})], basis.calls) # keys in both are not duplicated: test.add_lines(key2, (), [b"bar\n"]) basis.add_lines(key2, (), [b"bar\n"]) basis.calls = [] lines = list(test.iter_lines_added_or_present_in_keys([key2])) self.assertEqual([(b"bar\n", key2)], lines) self.assertEqual([], basis.calls) def test_keys(self): key1 = (b"foo1",) key2 = (b"foo2",) # all sources are asked for keys: basis, test = self.get_basis_and_test_knit() keys = test.keys() self.assertEqual(set(), set(keys)) self.assertEqual([("keys",)], basis.calls) # keys from a basis are returned: basis.add_lines(key1, (), []) basis.calls = [] keys = test.keys() self.assertEqual({key1}, set(keys)) self.assertEqual([("keys",)], basis.calls) # keys in both are not duplicated: test.add_lines(key2, (), []) basis.add_lines(key2, (), []) basis.calls = [] keys = test.keys() self.assertEqual(2, len(keys)) self.assertEqual({key1, key2}, set(keys)) self.assertEqual([("keys",)], basis.calls) def test_add_mpdiffs(self): # records are inserted as normal; add_mpdiff builds on # add_lines, so a smoke test should be all that's needed: key_basis = (b"bar",) key_delta = (b"zaphod",) basis, test = self.get_basis_and_test_knit() source = self.make_test_knit(name="source") basis.add_lines(key_basis, (), [b"foo\n"]) basis.calls = [] source.add_lines(key_basis, (), [b"foo\n"]) source.add_lines(key_delta, (key_basis,), [b"bar\n"]) diffs = source.make_mpdiffs([key_delta]) test.add_mpdiffs( [ ( key_delta, (key_basis,), source.get_sha1s([key_delta])[key_delta], diffs[0], ) ] ) self.assertEqual( [ ("get_parent_map", {key_basis}), ("get_record_stream", [key_basis], "unordered", True), ], basis.calls, ) self.assertEqual({key_delta: (key_basis,)}, test.get_parent_map([key_delta])) self.assertEqual( b"bar\n", next(test.get_record_stream([key_delta], "unordered", True)).get_bytes_as( "fulltext" ), ) def test_make_mpdiffs(self): # Generating an mpdiff across a stacking boundary should detect parent # texts regions. key = (b"foo",) key_left = (b"bar",) key_right = (b"zaphod",) basis, test = self.get_basis_and_test_knit() basis.add_lines(key_left, (), [b"bar\n"]) basis.add_lines(key_right, (), [b"zaphod\n"]) basis.calls = [] test.add_lines(key, (key_left, key_right), [b"bar\n", b"foo\n", b"zaphod\n"]) diffs = test.make_mpdiffs([key]) self.assertEqual( [ multiparent.MultiParent( [ multiparent.ParentText(0, 0, 0, 1), multiparent.NewText([b"foo\n"]), multiparent.ParentText(1, 0, 2, 1), ] ) ], diffs, ) self.assertEqual(3, len(basis.calls)) self.assertEqual( [ ("get_parent_map", {key_left, key_right}), ("get_parent_map", {key_left, key_right}), ], basis.calls[:-1], ) last_call = basis.calls[-1] self.assertEqual("get_record_stream", last_call[0]) self.assertEqual({key_left, key_right}, set(last_call[1])) self.assertEqual("topological", last_call[2]) self.assertEqual(True, last_call[3]) class TestNetworkBehaviour(KnitTests): """Tests for getting data out of/into knits over the network.""" def test_include_delta_closure_generates_a_knit_delta_closure(self): vf = self.make_test_knit(name="test") # put in three texts, giving ft, delta, delta vf.add_lines((b"base",), (), [b"base\n", b"content\n"]) vf.add_lines((b"d1",), ((b"base",),), [b"d1\n"]) vf.add_lines((b"d2",), ((b"d1",),), [b"d2\n"]) # But heuristics could interfere, so check what happened: self.assertEqual( ["knit-ft-gz", "knit-delta-gz", "knit-delta-gz"], [ record.storage_kind for record in vf.get_record_stream( [(b"base",), (b"d1",), (b"d2",)], "topological", False ) ], ) # generate a stream of just the deltas include_delta_closure=True, # serialise to the network, and check that we get a delta closure on the wire. stream = vf.get_record_stream([(b"d1",), (b"d2",)], "topological", True) netb = [record.get_bytes_as(record.storage_kind) for record in stream] # The first bytes should be a memo from _ContentMapGenerator, and the # second bytes should be empty (because its a API proxy not something # for wire serialisation. self.assertEqual(b"", netb[1]) bytes = netb[0] kind, _line_end = network_bytes_to_kind_and_offset(bytes) self.assertEqual("knit-delta-closure", kind) class TestContentMapGenerator(KnitTests): """Tests for ContentMapGenerator.""" def test_get_record_stream_gives_records(self): vf = self.make_test_knit(name="test") # put in three texts, giving ft, delta, delta vf.add_lines((b"base",), (), [b"base\n", b"content\n"]) vf.add_lines((b"d1",), ((b"base",),), [b"d1\n"]) vf.add_lines((b"d2",), ((b"d1",),), [b"d2\n"]) keys = [(b"d1",), (b"d2",)] generator = _VFContentMapGenerator(vf, keys, global_map=vf.get_parent_map(keys)) for record in generator.get_record_stream(): if record.key == (b"d1",): self.assertEqual(b"d1\n", record.get_bytes_as("fulltext")) else: self.assertEqual(b"d2\n", record.get_bytes_as("fulltext")) def test_get_record_stream_kinds_are_raw(self): vf = self.make_test_knit(name="test") # put in three texts, giving ft, delta, delta vf.add_lines((b"base",), (), [b"base\n", b"content\n"]) vf.add_lines((b"d1",), ((b"base",),), [b"d1\n"]) vf.add_lines((b"d2",), ((b"d1",),), [b"d2\n"]) keys = [(b"base",), (b"d1",), (b"d2",)] generator = _VFContentMapGenerator(vf, keys, global_map=vf.get_parent_map(keys)) kinds = { (b"base",): "knit-delta-closure", (b"d1",): "knit-delta-closure-ref", (b"d2",): "knit-delta-closure-ref", } for record in generator.get_record_stream(): self.assertEqual(kinds[record.key], record.storage_kind) class TestErrors(TestCase): def test_retry_with_new_packs(self): fake_exc_info = ("{exc type}", "{exc value}", "{exc traceback}") error = pack_repo.RetryWithNewPacks( "{context}", reload_occurred=False, exc_info=fake_exc_info ) self.assertEqual( "Pack files have changed, reload and retry. context: {context} {exc value}", str(error), ) bzrformats_3.4.0.orig/bzrformats/tests/test_lru_cache.py0000644000000000000000000003552715162074037020570 0ustar00# Copyright (C) 2006, 2008, 2009 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for the lru_cache module.""" from .. import lru_cache from . import TestCase def walk_lru(lru): """Test helper to walk the LRU list and assert its consistency.""" node = lru._most_recently_used if node is not None and node.prev is not None: raise AssertionError( "the _most_recently_used entry is not" " supposed to have a previous entry" f" {node}" ) while node is not None: if node.next_key is lru_cache._null_key: if node is not lru._least_recently_used: raise AssertionError( f"only the last node should have no next value: {node}" ) node_next = None else: node_next = lru._cache[node.next_key] if node_next.prev is not node: raise AssertionError( f"inconsistency found, node.next.prev != node: {node}" ) if node.prev is None: if node is not lru._most_recently_used: raise AssertionError( "only the _most_recently_used should" f" not have a previous node: {node}" ) else: if node.prev.next_key != node.key: raise AssertionError( f"inconsistency found, node.prev.next != node: {node}" ) yield node node = node_next class TestLRUCache(TestCase): """Test that LRU cache properly keeps track of entries.""" def test_cache_size(self): cache = lru_cache.LRUCache(max_cache=10) self.assertEqual(10, cache.cache_size()) cache = lru_cache.LRUCache(max_cache=256) self.assertEqual(256, cache.cache_size()) cache.resize(512) self.assertEqual(512, cache.cache_size()) def test_missing(self): cache = lru_cache.LRUCache(max_cache=10) self.assertNotIn("foo", cache) self.assertRaises(KeyError, cache.__getitem__, "foo") cache["foo"] = "bar" self.assertEqual("bar", cache["foo"]) self.assertIn("foo", cache) self.assertNotIn("bar", cache) def test_map_None(self): # Make sure that we can properly map None as a key. cache = lru_cache.LRUCache(max_cache=10) self.assertNotIn(None, cache) cache[None] = 1 self.assertEqual(1, cache[None]) cache[None] = 2 self.assertEqual(2, cache[None]) # Test the various code paths of __getitem__, to make sure that we can # handle when None is the key for the LRU and the MRU cache[1] = 3 cache[None] = 1 cache[None] cache[1] cache[None] self.assertEqual([None, 1], [n.key for n in walk_lru(cache)]) def test_add__null_key(self): cache = lru_cache.LRUCache(max_cache=10) self.assertRaises(ValueError, cache.__setitem__, lru_cache._null_key, 1) def test_overflow(self): """Adding extra entries will pop out old ones.""" cache = lru_cache.LRUCache(max_cache=1, after_cleanup_count=1) cache["foo"] = "bar" # With a max cache of 1, adding 'baz' should pop out 'foo' cache["baz"] = "biz" self.assertNotIn("foo", cache) self.assertIn("baz", cache) self.assertEqual("biz", cache["baz"]) def test_by_usage(self): """Accessing entries bumps them up in priority.""" cache = lru_cache.LRUCache(max_cache=2) cache["baz"] = "biz" cache["foo"] = "bar" self.assertEqual("biz", cache["baz"]) # This must kick out 'foo' because it was the last accessed cache["nub"] = "in" self.assertNotIn("foo", cache) def test_len(self): cache = lru_cache.LRUCache(max_cache=10, after_cleanup_count=10) cache[1] = 10 cache[2] = 20 cache[3] = 30 cache[4] = 40 self.assertEqual(4, len(cache)) cache[5] = 50 cache[6] = 60 cache[7] = 70 cache[8] = 80 self.assertEqual(8, len(cache)) cache[1] = 15 # replacement self.assertEqual(8, len(cache)) cache[9] = 90 cache[10] = 100 cache[11] = 110 # We hit the max self.assertEqual(10, len(cache)) self.assertEqual( [11, 10, 9, 1, 8, 7, 6, 5, 4, 3], [n.key for n in walk_lru(cache)] ) def test_cleanup_shrinks_to_after_clean_count(self): cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=3) cache[1] = 10 cache[2] = 20 cache[3] = 25 cache[4] = 30 cache[5] = 35 self.assertEqual(5, len(cache)) # This will bump us over the max, which causes us to shrink down to # after_cleanup_cache size cache[6] = 40 self.assertEqual(3, len(cache)) def test_after_cleanup_larger_than_max(self): cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=10) self.assertEqual(5, cache._after_cleanup_count) def test_after_cleanup_none(self): cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=None) # By default _after_cleanup_size is 80% of the normal size self.assertEqual(4, cache._after_cleanup_count) def test_cleanup(self): cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=2) # Add these in order cache[1] = 10 cache[2] = 20 cache[3] = 25 cache[4] = 30 cache[5] = 35 self.assertEqual(5, len(cache)) # Force a compaction cache.cleanup() self.assertEqual(2, len(cache)) def test_preserve_last_access_order(self): cache = lru_cache.LRUCache(max_cache=5) # Add these in order cache[1] = 10 cache[2] = 20 cache[3] = 25 cache[4] = 30 cache[5] = 35 self.assertEqual([5, 4, 3, 2, 1], [n.key for n in walk_lru(cache)]) # Now access some randomly cache[2] cache[5] cache[3] cache[2] self.assertEqual([2, 3, 5, 4, 1], [n.key for n in walk_lru(cache)]) def test_get(self): cache = lru_cache.LRUCache(max_cache=5) cache[1] = 10 cache[2] = 20 self.assertEqual(20, cache.get(2)) self.assertIs(None, cache.get(3)) obj = object() self.assertIs(obj, cache.get(3, obj)) self.assertEqual([2, 1], [n.key for n in walk_lru(cache)]) self.assertEqual(10, cache.get(1)) self.assertEqual([1, 2], [n.key for n in walk_lru(cache)]) def test_keys(self): cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=5) cache[1] = 2 cache[2] = 3 cache[3] = 4 self.assertEqual([1, 2, 3], sorted(cache.keys())) cache[4] = 5 cache[5] = 6 cache[6] = 7 self.assertEqual([2, 3, 4, 5, 6], sorted(cache.keys())) def test_resize_smaller(self): cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=4) cache[1] = 2 cache[2] = 3 cache[3] = 4 cache[4] = 5 cache[5] = 6 self.assertEqual([1, 2, 3, 4, 5], sorted(cache.keys())) cache[6] = 7 self.assertEqual([3, 4, 5, 6], sorted(cache.keys())) # Now resize to something smaller, which triggers a cleanup cache.resize(max_cache=3, after_cleanup_count=2) self.assertEqual([5, 6], sorted(cache.keys())) # Adding something will use the new size cache[7] = 8 self.assertEqual([5, 6, 7], sorted(cache.keys())) cache[8] = 9 self.assertEqual([7, 8], sorted(cache.keys())) def test_resize_larger(self): cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=4) cache[1] = 2 cache[2] = 3 cache[3] = 4 cache[4] = 5 cache[5] = 6 self.assertEqual([1, 2, 3, 4, 5], sorted(cache.keys())) cache[6] = 7 self.assertEqual([3, 4, 5, 6], sorted(cache.keys())) cache.resize(max_cache=8, after_cleanup_count=6) self.assertEqual([3, 4, 5, 6], sorted(cache.keys())) cache[7] = 8 cache[8] = 9 cache[9] = 10 cache[10] = 11 self.assertEqual([3, 4, 5, 6, 7, 8, 9, 10], sorted(cache.keys())) cache[11] = 12 # triggers cleanup back to new after_cleanup_count self.assertEqual([6, 7, 8, 9, 10, 11], sorted(cache.keys())) class TestLRUSizeCache(TestCase): def test_basic_init(self): cache = lru_cache.LRUSizeCache() self.assertEqual(2048, cache._max_cache) self.assertEqual(int(cache._max_size * 0.8), cache._after_cleanup_size) self.assertEqual(0, cache._value_size) def test_add__null_key(self): cache = lru_cache.LRUSizeCache() self.assertRaises(ValueError, cache.__setitem__, lru_cache._null_key, 1) def test_add_tracks_size(self): cache = lru_cache.LRUSizeCache() self.assertEqual(0, cache._value_size) cache["my key"] = "my value text" self.assertEqual(13, cache._value_size) def test_remove_tracks_size(self): cache = lru_cache.LRUSizeCache() self.assertEqual(0, cache._value_size) cache["my key"] = "my value text" self.assertEqual(13, cache._value_size) node = cache._cache["my key"] cache._remove_node(node) self.assertEqual(0, cache._value_size) def test_no_add_over_size(self): """Adding a large value may not be cached at all.""" cache = lru_cache.LRUSizeCache(max_size=10, after_cleanup_size=5) self.assertEqual(0, cache._value_size) self.assertEqual({}, cache.as_dict()) cache["test"] = "key" self.assertEqual(3, cache._value_size) self.assertEqual({"test": "key"}, cache.as_dict()) cache["test2"] = "key that is too big" self.assertEqual(3, cache._value_size) self.assertEqual({"test": "key"}, cache.as_dict()) # If we would add a key, only to cleanup and remove all cached entries, # then obviously that value should not be stored cache["test3"] = "bigkey" self.assertEqual(3, cache._value_size) self.assertEqual({"test": "key"}, cache.as_dict()) cache["test4"] = "bikey" self.assertEqual(3, cache._value_size) self.assertEqual({"test": "key"}, cache.as_dict()) def test_adding_clears_cache_based_on_size(self): """The cache is cleared in LRU order until small enough.""" cache = lru_cache.LRUSizeCache(max_size=20) cache["key1"] = "value" # 5 chars cache["key2"] = "value2" # 6 chars cache["key3"] = "value23" # 7 chars self.assertEqual(5 + 6 + 7, cache._value_size) cache["key2"] # reference key2 so it gets a newer reference time cache["key4"] = "value234" # 8 chars, over limit # We have to remove 2 keys to get back under limit self.assertEqual(6 + 8, cache._value_size) self.assertEqual({"key2": "value2", "key4": "value234"}, cache.as_dict()) def test_adding_clears_to_after_cleanup_size(self): cache = lru_cache.LRUSizeCache(max_size=20, after_cleanup_size=10) cache["key1"] = "value" # 5 chars cache["key2"] = "value2" # 6 chars cache["key3"] = "value23" # 7 chars self.assertEqual(5 + 6 + 7, cache._value_size) cache["key2"] # reference key2 so it gets a newer reference time cache["key4"] = "value234" # 8 chars, over limit # We have to remove 3 keys to get back under limit self.assertEqual(8, cache._value_size) self.assertEqual({"key4": "value234"}, cache.as_dict()) def test_custom_sizes(self): def size_of_list(lst): return sum(len(x) for x in lst) cache = lru_cache.LRUSizeCache( max_size=20, after_cleanup_size=10, compute_size=size_of_list ) cache["key1"] = ["val", "ue"] # 5 chars cache["key2"] = ["val", "ue2"] # 6 chars cache["key3"] = ["val", "ue23"] # 7 chars self.assertEqual(5 + 6 + 7, cache._value_size) cache["key2"] # reference key2 so it gets a newer reference time cache["key4"] = ["value", "234"] # 8 chars, over limit # We have to remove 3 keys to get back under limit self.assertEqual(8, cache._value_size) self.assertEqual({"key4": ["value", "234"]}, cache.as_dict()) def test_cleanup(self): cache = lru_cache.LRUSizeCache(max_size=20, after_cleanup_size=10) # Add these in order cache["key1"] = "value" # 5 chars cache["key2"] = "value2" # 6 chars cache["key3"] = "value23" # 7 chars self.assertEqual(5 + 6 + 7, cache._value_size) cache.cleanup() # Only the most recent fits after cleaning up self.assertEqual(7, cache._value_size) def test_keys(self): cache = lru_cache.LRUSizeCache(max_size=10) cache[1] = "a" cache[2] = "b" cache[3] = "cdef" self.assertEqual([1, 2, 3], sorted(cache.keys())) def test_resize_smaller(self): cache = lru_cache.LRUSizeCache(max_size=10, after_cleanup_size=9) cache[1] = "abc" cache[2] = "def" cache[3] = "ghi" cache[4] = "jkl" # Triggers a cleanup self.assertEqual([2, 3, 4], sorted(cache.keys())) # Resize should also cleanup again cache.resize(max_size=6, after_cleanup_size=4) self.assertEqual([4], sorted(cache.keys())) # Adding should use the new max size cache[5] = "mno" self.assertEqual([4, 5], sorted(cache.keys())) cache[6] = "pqr" self.assertEqual([6], sorted(cache.keys())) def test_resize_larger(self): cache = lru_cache.LRUSizeCache(max_size=10, after_cleanup_size=9) cache[1] = "abc" cache[2] = "def" cache[3] = "ghi" cache[4] = "jkl" # Triggers a cleanup self.assertEqual([2, 3, 4], sorted(cache.keys())) cache.resize(max_size=15, after_cleanup_size=12) self.assertEqual([2, 3, 4], sorted(cache.keys())) cache[5] = "mno" cache[6] = "pqr" self.assertEqual([2, 3, 4, 5, 6], sorted(cache.keys())) cache[7] = "stu" self.assertEqual([4, 5, 6, 7], sorted(cache.keys())) bzrformats_3.4.0.orig/bzrformats/tests/test_merge.py0000644000000000000000000007330015162115107017723 0ustar00# Copyright (C) 2005-2012, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for merge implementations.""" from bzrformats import knit, versionedfile from bzrformats.merge import _PlanMerge from . import TestCaseWithMemoryTransport class TestPlanMerge(TestCaseWithMemoryTransport): """Tests for _PlanMerge and plan_merge/plan_lca_merge functionality.""" def setUp(self): """Set up versioned file infrastructure for merge tests.""" super().setUp() mapper = versionedfile.PrefixMapper() factory = knit.make_file_factory(True, mapper) self.vf = factory(self.get_transport()) self.plan_merge_vf = versionedfile._PlanMergeVersionedFile(b"root") self.plan_merge_vf.fallback_versionedfiles.append(self.vf) def add_version(self, key, parents, text): """Add a version to the backing versioned file.""" self.vf.add_lines(key, parents, [bytes([c]) + b"\n" for c in bytearray(text)]) def add_rev(self, prefix, revision_id, parents, text): """Add a revision to the versioned file using a prefix/revision_id key.""" self.add_version((prefix, revision_id), [(prefix, p) for p in parents], text) def add_uncommitted_version(self, key, parents, text): """Add an uncommitted version directly to the plan merge versioned file.""" self.plan_merge_vf.add_lines( key, parents, [bytes([c]) + b"\n" for c in bytearray(text)] ) def setup_plan_merge(self): """Set up a standard 3-revision merge scenario and return a _PlanMerge.""" self.add_rev(b"root", b"A", [], b"abc") self.add_rev(b"root", b"B", [b"A"], b"acehg") self.add_rev(b"root", b"C", [b"A"], b"fabg") return _PlanMerge(b"B", b"C", self.plan_merge_vf, (b"root",)) def setup_plan_merge_uncommitted(self): """Set up a merge scenario with uncommitted versions and return a _PlanMerge.""" self.add_version((b"root", b"A"), [], b"abc") self.add_uncommitted_version((b"root", b"B:"), [(b"root", b"A")], b"acehg") self.add_uncommitted_version((b"root", b"C:"), [(b"root", b"A")], b"fabg") return _PlanMerge(b"B:", b"C:", self.plan_merge_vf, (b"root",)) def test_base_from_plan(self): """Test that base_from_plan reconstructs the common base text.""" self.setup_plan_merge() plan = self.plan_merge_vf.plan_merge(b"B", b"C") pwm = versionedfile.PlanWeaveMerge(plan) self.assertEqual([b"a\n", b"b\n", b"c\n"], pwm.base_from_plan()) def test_unique_lines(self): """Test that _unique_lines identifies lines unique to each revision.""" plan = self.setup_plan_merge() self.assertEqual( plan._unique_lines(plan._get_matching_blocks(b"B", b"C")), ([1, 2, 3], [0, 2]), ) def test_plan_merge(self): """Test that plan_merge produces the correct sequence of merge operations.""" self.setup_plan_merge() plan = self.plan_merge_vf.plan_merge(b"B", b"C") self.assertEqual( [ ("new-b", b"f\n"), ("unchanged", b"a\n"), ("killed-a", b"b\n"), ("killed-b", b"c\n"), ("new-a", b"e\n"), ("new-a", b"h\n"), ("new-a", b"g\n"), ("new-b", b"g\n"), ], list(plan), ) def test_plan_merge_cherrypick(self): self.add_rev(b"root", b"A", [], b"abc") self.add_rev(b"root", b"B", [b"A"], b"abcde") self.add_rev(b"root", b"C", [b"A"], b"abcefg") self.add_rev(b"root", b"D", [b"A", b"B", b"C"], b"abcdegh") my_plan = _PlanMerge(b"B", b"D", self.plan_merge_vf, (b"root",)) # We shortcut when one text supersedes the other in the per-file graph. # We don't actually need to compare the texts at this point. self.assertEqual( [ ("new-b", b"a\n"), ("new-b", b"b\n"), ("new-b", b"c\n"), ("new-b", b"d\n"), ("new-b", b"e\n"), ("new-b", b"g\n"), ("new-b", b"h\n"), ], list(my_plan.plan_merge()), ) def test_plan_merge_no_common_ancestor(self): self.add_rev(b"root", b"A", [], b"abc") self.add_rev(b"root", b"B", [], b"xyz") my_plan = _PlanMerge(b"A", b"B", self.plan_merge_vf, (b"root",)) self.assertEqual( [ ("new-a", b"a\n"), ("new-a", b"b\n"), ("new-a", b"c\n"), ("new-b", b"x\n"), ("new-b", b"y\n"), ("new-b", b"z\n"), ], list(my_plan.plan_merge()), ) def test_plan_merge_tail_ancestors(self): # The graph looks like this: # A # Common to all ancestors # / \ # B C # Ancestors of E, only common to one side # |\ /| # D E F # D, F are unique to G, H respectively # |/ \| # E is the LCA for G & H, and the unique LCA for # G H # I, J # |\ /| # | X | # |/ \| # I J # criss-cross merge of G, H # # In this situation, a simple pruning of ancestors of E will leave D & # F "dangling", which looks like they introduce lines different from # the ones in E, but in actuality C&B introduced the lines, and they # are already present in E # Introduce the base text self.add_rev(b"root", b"A", [], b"abc") # Introduces a new line B self.add_rev(b"root", b"B", [b"A"], b"aBbc") # Introduces a new line C self.add_rev(b"root", b"C", [b"A"], b"abCc") # Introduce new line D self.add_rev(b"root", b"D", [b"B"], b"DaBbc") # Merges B and C by just incorporating both self.add_rev(b"root", b"E", [b"B", b"C"], b"aBbCc") # Introduce new line F self.add_rev(b"root", b"F", [b"C"], b"abCcF") # Merge D & E by just combining the texts self.add_rev(b"root", b"G", [b"D", b"E"], b"DaBbCc") # Merge F & E by just combining the texts self.add_rev(b"root", b"H", [b"F", b"E"], b"aBbCcF") # Merge G & H by just combining texts self.add_rev(b"root", b"I", [b"G", b"H"], b"DaBbCcF") # Merge G & H but supersede an old line in B self.add_rev(b"root", b"J", [b"H", b"G"], b"DaJbCcF") plan = self.plan_merge_vf.plan_merge(b"I", b"J") self.assertEqual( [ ("unchanged", b"D\n"), ("unchanged", b"a\n"), ("killed-b", b"B\n"), ("new-b", b"J\n"), ("unchanged", b"b\n"), ("unchanged", b"C\n"), ("unchanged", b"c\n"), ("unchanged", b"F\n"), ], list(plan), ) def test_plan_merge_tail_triple_ancestors(self): # The graph looks like this: # A # Common to all ancestors # / \ # B C # Ancestors of E, only common to one side # |\ /| # D E F # D, F are unique to G, H respectively # |/|\| # E is the LCA for G & H, and the unique LCA for # G Q H # I, J # |\ /| # Q is just an extra node which is merged into both # | X | # I and J # |/ \| # I J # criss-cross merge of G, H # # This is the same as the test_plan_merge_tail_ancestors, except we add # a third LCA that doesn't add new lines, but will trigger our more # involved ancestry logic self.add_rev(b"root", b"A", [], b"abc") self.add_rev(b"root", b"B", [b"A"], b"aBbc") self.add_rev(b"root", b"C", [b"A"], b"abCc") self.add_rev(b"root", b"D", [b"B"], b"DaBbc") self.add_rev(b"root", b"E", [b"B", b"C"], b"aBbCc") self.add_rev(b"root", b"F", [b"C"], b"abCcF") self.add_rev(b"root", b"G", [b"D", b"E"], b"DaBbCc") self.add_rev(b"root", b"H", [b"F", b"E"], b"aBbCcF") self.add_rev(b"root", b"Q", [b"E"], b"aBbCc") self.add_rev(b"root", b"I", [b"G", b"Q", b"H"], b"DaBbCcF") # Merge G & H but supersede an old line in B self.add_rev(b"root", b"J", [b"H", b"Q", b"G"], b"DaJbCcF") plan = self.plan_merge_vf.plan_merge(b"I", b"J") self.assertEqual( [ ("unchanged", b"D\n"), ("unchanged", b"a\n"), ("killed-b", b"B\n"), ("new-b", b"J\n"), ("unchanged", b"b\n"), ("unchanged", b"C\n"), ("unchanged", b"c\n"), ("unchanged", b"F\n"), ], list(plan), ) def test_plan_merge_2_tail_triple_ancestors(self): # The graph looks like this: # A B # 2 tails going back to NULL # |\ /| # D E F # D, is unique to G, F to H # |/|\| # E is the LCA for G & H, and the unique LCA for # G Q H # I, J # |\ /| # Q is just an extra node which is merged into both # | X | # I and J # |/ \| # I J # criss-cross merge of G, H (and Q) # # This is meant to test after hitting a 3-way LCA, and multiple tail # ancestors (only have NULL_REVISION in common) self.add_rev(b"root", b"A", [], b"abc") self.add_rev(b"root", b"B", [], b"def") self.add_rev(b"root", b"D", [b"A"], b"Dabc") self.add_rev(b"root", b"E", [b"A", b"B"], b"abcdef") self.add_rev(b"root", b"F", [b"B"], b"defF") self.add_rev(b"root", b"G", [b"D", b"E"], b"Dabcdef") self.add_rev(b"root", b"H", [b"F", b"E"], b"abcdefF") self.add_rev(b"root", b"Q", [b"E"], b"abcdef") self.add_rev(b"root", b"I", [b"G", b"Q", b"H"], b"DabcdefF") # Merge G & H but supersede an old line in B self.add_rev(b"root", b"J", [b"H", b"Q", b"G"], b"DabcdJfF") plan = self.plan_merge_vf.plan_merge(b"I", b"J") self.assertEqual( [ ("unchanged", b"D\n"), ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("killed-b", b"e\n"), ("new-b", b"J\n"), ("unchanged", b"f\n"), ("unchanged", b"F\n"), ], list(plan), ) def test_plan_merge_uncommitted_files(self): self.setup_plan_merge_uncommitted() plan = self.plan_merge_vf.plan_merge(b"B:", b"C:") self.assertEqual( [ ("new-b", b"f\n"), ("unchanged", b"a\n"), ("killed-a", b"b\n"), ("killed-b", b"c\n"), ("new-a", b"e\n"), ("new-a", b"h\n"), ("new-a", b"g\n"), ("new-b", b"g\n"), ], list(plan), ) def test_plan_merge_insert_order(self): """Weave merges are sensitive to the order of insertion. Specifically for overlapping regions, it effects which region gets put 'first'. And when a user resolves an overlapping merge, if they use the same ordering, then the lines match the parents, if they don't only *some* of the lines match. """ self.add_rev(b"root", b"A", [], b"abcdef") self.add_rev(b"root", b"B", [b"A"], b"abwxcdef") self.add_rev(b"root", b"C", [b"A"], b"abyzcdef") # Merge, and resolve the conflict by adding *both* sets of lines # If we get the ordering wrong, these will look like new lines in D, # rather than carried over from B, C self.add_rev(b"root", b"D", [b"B", b"C"], b"abwxyzcdef") # Supersede the lines in B and delete the lines in C, which will # conflict if they are treated as being in D self.add_rev(b"root", b"E", [b"C", b"B"], b"abnocdef") # Same thing for the lines in C self.add_rev(b"root", b"F", [b"C"], b"abpqcdef") plan = self.plan_merge_vf.plan_merge(b"D", b"E") self.assertEqual( [ ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("killed-b", b"w\n"), ("killed-b", b"x\n"), ("killed-b", b"y\n"), ("killed-b", b"z\n"), ("new-b", b"n\n"), ("new-b", b"o\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("unchanged", b"e\n"), ("unchanged", b"f\n"), ], list(plan), ) plan = self.plan_merge_vf.plan_merge(b"E", b"D") # Going in the opposite direction shows the effect of the opposite plan self.assertEqual( [ ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("new-b", b"w\n"), ("new-b", b"x\n"), ("killed-a", b"y\n"), ("killed-a", b"z\n"), ("killed-both", b"w\n"), ("killed-both", b"x\n"), ("new-a", b"n\n"), ("new-a", b"o\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("unchanged", b"e\n"), ("unchanged", b"f\n"), ], list(plan), ) def test_plan_merge_criss_cross(self): # This is specificly trying to trigger problems when using limited # ancestry and weaves. The ancestry graph looks like: # XX unused ancestor, should not show up in the weave # | # A Unique LCA # |\ # B \ Introduces a line 'foo' # / \ \ # C D E C & D both have 'foo', E has different changes # |\ /| | # | X | | # |/ \|/ # F G All of C, D, E are merged into F and G, so they are # all common ancestors. # # The specific issue with weaves: # B introduced a text ('foo') that is present in both C and D. # If we do not include B (because it isn't an ancestor of E), then # the A=>C and A=>D look like both sides independently introduce the # text ('foo'). If F does not modify the text, it would still appear # to have deleted on of the versions from C or D. If G then modifies # 'foo', it should appear as superseding the value in F (since it # came from B), rather than conflict because of the resolution during # C & D. self.add_rev(b"root", b"XX", [], b"qrs") self.add_rev(b"root", b"A", [b"XX"], b"abcdef") self.add_rev(b"root", b"B", [b"A"], b"axcdef") self.add_rev(b"root", b"C", [b"B"], b"axcdefg") self.add_rev(b"root", b"D", [b"B"], b"haxcdef") self.add_rev(b"root", b"E", [b"A"], b"abcdyf") # Simple combining of all texts self.add_rev(b"root", b"F", [b"C", b"D", b"E"], b"haxcdyfg") # combine and supersede 'x' self.add_rev(b"root", b"G", [b"C", b"D", b"E"], b"hazcdyfg") plan = self.plan_merge_vf.plan_merge(b"F", b"G") self.assertEqual( [ ("unchanged", b"h\n"), ("unchanged", b"a\n"), ("killed-base", b"b\n"), ("killed-b", b"x\n"), ("new-b", b"z\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("killed-base", b"e\n"), ("unchanged", b"y\n"), ("unchanged", b"f\n"), ("unchanged", b"g\n"), ], list(plan), ) plan = self.plan_merge_vf.plan_lca_merge(b"F", b"G") # This is one of the main differences between plan_merge and # plan_lca_merge. plan_lca_merge generates a conflict for 'x => z', # because 'x' was not present in one of the bases. However, in this # case it is spurious because 'x' does not exist in the global base A. self.assertEqual( [ ("unchanged", b"h\n"), ("unchanged", b"a\n"), ("conflicted-a", b"x\n"), ("new-b", b"z\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("unchanged", b"y\n"), ("unchanged", b"f\n"), ("unchanged", b"g\n"), ], list(plan), ) def test_criss_cross_flip_flop(self): # This is specificly trying to trigger problems when using limited # ancestry and weaves. The ancestry graph looks like: # XX unused ancestor, should not show up in the weave # | # A Unique LCA # / \ # B C B & C both introduce a new line # |\ /| # | X | # |/ \| # D E B & C are both merged, so both are common ancestors # In the process of merging, both sides order the new # lines differently # self.add_rev(b"root", b"XX", [], b"qrs") self.add_rev(b"root", b"A", [b"XX"], b"abcdef") self.add_rev(b"root", b"B", [b"A"], b"abcdgef") self.add_rev(b"root", b"C", [b"A"], b"abcdhef") self.add_rev(b"root", b"D", [b"B", b"C"], b"abcdghef") self.add_rev(b"root", b"E", [b"C", b"B"], b"abcdhgef") plan = list(self.plan_merge_vf.plan_merge(b"D", b"E")) self.assertEqual( [ ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("new-b", b"h\n"), ("unchanged", b"g\n"), ("killed-b", b"h\n"), ("unchanged", b"e\n"), ("unchanged", b"f\n"), ], plan, ) pwm = versionedfile.PlanWeaveMerge(plan) self.assertEqualDiff( b"a\nb\nc\nd\ng\nh\ne\nf\n", b"".join(pwm.base_from_plan()) ) # Reversing the order reverses the merge plan, and final order of 'hg' # => 'gh' plan = list(self.plan_merge_vf.plan_merge(b"E", b"D")) self.assertEqual( [ ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("new-b", b"g\n"), ("unchanged", b"h\n"), ("killed-b", b"g\n"), ("unchanged", b"e\n"), ("unchanged", b"f\n"), ], plan, ) pwm = versionedfile.PlanWeaveMerge(plan) self.assertEqualDiff( b"a\nb\nc\nd\nh\ng\ne\nf\n", b"".join(pwm.base_from_plan()) ) # This is where lca differs, in that it (fairly correctly) determines # that there is a conflict because both sides resolved the merge # differently plan = list(self.plan_merge_vf.plan_lca_merge(b"D", b"E")) self.assertEqual( [ ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("conflicted-b", b"h\n"), ("unchanged", b"g\n"), ("conflicted-a", b"h\n"), ("unchanged", b"e\n"), ("unchanged", b"f\n"), ], plan, ) pwm = versionedfile.PlanWeaveMerge(plan) self.assertEqualDiff(b"a\nb\nc\nd\ng\ne\nf\n", b"".join(pwm.base_from_plan())) # Reversing it changes what line is doubled, but still gives a # double-conflict plan = list(self.plan_merge_vf.plan_lca_merge(b"E", b"D")) self.assertEqual( [ ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("unchanged", b"c\n"), ("unchanged", b"d\n"), ("conflicted-b", b"g\n"), ("unchanged", b"h\n"), ("conflicted-a", b"g\n"), ("unchanged", b"e\n"), ("unchanged", b"f\n"), ], plan, ) pwm = versionedfile.PlanWeaveMerge(plan) self.assertEqualDiff(b"a\nb\nc\nd\nh\ne\nf\n", b"".join(pwm.base_from_plan())) def assertRemoveExternalReferences( self, filtered_parent_map, child_map, tails, parent_map ): """Assert results for _PlanMerge._remove_external_references.""" ( act_filtered_parent_map, act_child_map, act_tails, ) = _PlanMerge._remove_external_references(parent_map) # The parent map *should* preserve ordering, but the ordering of # children is not strictly defined # child_map = dict((k, sorted(children)) # for k, children in child_map.iteritems()) # act_child_map = dict(k, sorted(children) # for k, children in act_child_map.iteritems()) self.assertEqual(filtered_parent_map, act_filtered_parent_map) self.assertEqual(child_map, act_child_map) self.assertEqual(sorted(tails), sorted(act_tails)) def test__remove_external_references(self): # First, nothing to remove self.assertRemoveExternalReferences( {3: [2], 2: [1], 1: []}, {1: [2], 2: [3], 3: []}, [1], {3: [2], 2: [1], 1: []}, ) # The reverse direction self.assertRemoveExternalReferences( {1: [2], 2: [3], 3: []}, {3: [2], 2: [1], 1: []}, [3], {1: [2], 2: [3], 3: []}, ) # Extra references self.assertRemoveExternalReferences( {3: [2], 2: [1], 1: []}, {1: [2], 2: [3], 3: []}, [1], {3: [2, 4], 2: [1, 5], 1: [6]}, ) # Multiple tails self.assertRemoveExternalReferences( {4: [2, 3], 3: [], 2: [1], 1: []}, {1: [2], 2: [4], 3: [4], 4: []}, [1, 3], {4: [2, 3], 3: [5], 2: [1], 1: [6]}, ) # Multiple children self.assertRemoveExternalReferences( {1: [3], 2: [3, 4], 3: [], 4: []}, {1: [], 2: [], 3: [1, 2], 4: [2]}, [3, 4], {1: [3], 2: [3, 4], 3: [5], 4: []}, ) def assertPruneTails(self, pruned_map, tails, parent_map): child_map = {} for key, parent_keys in parent_map.items(): child_map.setdefault(key, []) for pkey in parent_keys: child_map.setdefault(pkey, []).append(key) _PlanMerge._prune_tails(parent_map, child_map, tails) self.assertEqual(pruned_map, parent_map) def test__prune_tails(self): # Nothing requested to prune self.assertPruneTails({1: [], 2: [], 3: []}, [], {1: [], 2: [], 3: []}) # Prune a single entry self.assertPruneTails({1: [], 3: []}, [2], {1: [], 2: [], 3: []}) # Prune a chain self.assertPruneTails({1: []}, [3], {1: [], 2: [3], 3: []}) # Prune a chain with a diamond self.assertPruneTails({1: []}, [5], {1: [], 2: [3, 4], 3: [5], 4: [5], 5: []}) # Prune a partial chain self.assertPruneTails( {1: [6], 6: []}, [5], {1: [2, 6], 2: [3, 4], 3: [5], 4: [5], 5: [], 6: []} ) # Prune a chain with multiple tips, that pulls out intermediates self.assertPruneTails( {1: [3], 3: []}, [4, 5], {1: [2, 3], 2: [4, 5], 3: [], 4: [], 5: []} ) self.assertPruneTails( {1: [3], 3: []}, [5, 4], {1: [2, 3], 2: [4, 5], 3: [], 4: [], 5: []} ) def test_subtract_plans(self): old_plan = [ ("unchanged", b"a\n"), ("new-a", b"b\n"), ("killed-a", b"c\n"), ("new-b", b"d\n"), ("new-b", b"e\n"), ("killed-b", b"f\n"), ("killed-b", b"g\n"), ] new_plan = [ ("unchanged", b"a\n"), ("new-a", b"b\n"), ("killed-a", b"c\n"), ("new-b", b"d\n"), ("new-b", b"h\n"), ("killed-b", b"f\n"), ("killed-b", b"i\n"), ] subtracted_plan = [ ("unchanged", b"a\n"), ("new-a", b"b\n"), ("killed-a", b"c\n"), ("new-b", b"h\n"), ("unchanged", b"f\n"), ("killed-b", b"i\n"), ] self.assertEqual( subtracted_plan, list(_PlanMerge._subtract_plans(old_plan, new_plan)) ) def setup_merge_with_base(self): self.add_rev(b"root", b"COMMON", [], b"abc") self.add_rev(b"root", b"THIS", [b"COMMON"], b"abcd") self.add_rev(b"root", b"BASE", [b"COMMON"], b"eabc") self.add_rev(b"root", b"OTHER", [b"BASE"], b"eafb") def test_plan_merge_with_base(self): self.setup_merge_with_base() plan = self.plan_merge_vf.plan_merge(b"THIS", b"OTHER", b"BASE") self.assertEqual( [ ("unchanged", b"a\n"), ("new-b", b"f\n"), ("unchanged", b"b\n"), ("killed-b", b"c\n"), ("new-a", b"d\n"), ], list(plan), ) def test_plan_lca_merge(self): self.setup_plan_merge() plan = self.plan_merge_vf.plan_lca_merge(b"B", b"C") self.assertEqual( [ ("new-b", b"f\n"), ("unchanged", b"a\n"), ("killed-b", b"c\n"), ("new-a", b"e\n"), ("new-a", b"h\n"), ("killed-a", b"b\n"), ("unchanged", b"g\n"), ], list(plan), ) def test_plan_lca_merge_uncommitted_files(self): self.setup_plan_merge_uncommitted() plan = self.plan_merge_vf.plan_lca_merge(b"B:", b"C:") self.assertEqual( [ ("new-b", b"f\n"), ("unchanged", b"a\n"), ("killed-b", b"c\n"), ("new-a", b"e\n"), ("new-a", b"h\n"), ("killed-a", b"b\n"), ("unchanged", b"g\n"), ], list(plan), ) def test_plan_lca_merge_with_base(self): self.setup_merge_with_base() plan = self.plan_merge_vf.plan_lca_merge(b"THIS", b"OTHER", b"BASE") self.assertEqual( [ ("unchanged", b"a\n"), ("new-b", b"f\n"), ("unchanged", b"b\n"), ("killed-b", b"c\n"), ("new-a", b"d\n"), ], list(plan), ) def test_plan_lca_merge_with_criss_cross(self): self.add_version((b"root", b"ROOT"), [], b"abc") # each side makes a change self.add_version((b"root", b"REV1"), [(b"root", b"ROOT")], b"abcd") self.add_version((b"root", b"REV2"), [(b"root", b"ROOT")], b"abce") # both sides merge, discarding others' changes self.add_version( (b"root", b"LCA1"), [(b"root", b"REV1"), (b"root", b"REV2")], b"abcd" ) self.add_version( (b"root", b"LCA2"), [(b"root", b"REV1"), (b"root", b"REV2")], b"fabce" ) plan = self.plan_merge_vf.plan_lca_merge(b"LCA1", b"LCA2") self.assertEqual( [ ("new-b", b"f\n"), ("unchanged", b"a\n"), ("unchanged", b"b\n"), ("unchanged", b"c\n"), ("conflicted-a", b"d\n"), ("conflicted-b", b"e\n"), ], list(plan), ) def test_plan_lca_merge_with_null(self): self.add_version((b"root", b"A"), [], b"ab") self.add_version((b"root", b"B"), [], b"bc") plan = self.plan_merge_vf.plan_lca_merge(b"A", b"B") self.assertEqual( [ ("new-a", b"a\n"), ("unchanged", b"b\n"), ("new-b", b"c\n"), ], list(plan), ) def test_plan_merge_with_delete_and_change(self): self.add_rev(b"root", b"C", [], b"a") self.add_rev(b"root", b"A", [b"C"], b"b") self.add_rev(b"root", b"B", [b"C"], b"") plan = self.plan_merge_vf.plan_merge(b"A", b"B") self.assertEqual( [ ("killed-both", b"a\n"), ("new-a", b"b\n"), ], list(plan), ) def test_plan_merge_with_move_and_change(self): self.add_rev(b"root", b"C", [], b"abcd") self.add_rev(b"root", b"A", [b"C"], b"acbd") self.add_rev(b"root", b"B", [b"C"], b"aBcd") plan = self.plan_merge_vf.plan_merge(b"A", b"B") self.assertEqual( [ ("unchanged", b"a\n"), ("new-a", b"c\n"), ("killed-b", b"b\n"), ("new-b", b"B\n"), ("killed-a", b"c\n"), ("unchanged", b"d\n"), ], list(plan), ) bzrformats_3.4.0.orig/bzrformats/tests/test_multiparent.py0000644000000000000000000002625615162115103021174 0ustar00# Copyright (C) 2007, 2009, 2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA import patiencediff from bzrformats import tests from .. import multiparent from . import TestCase LINES_1 = b"a\nb\nc\nd\ne\n".splitlines(True) LINES_2 = b"a\nc\nd\ne\n".splitlines(True) LINES_3 = b"a\nb\nc\nd\n".splitlines(True) LF_SPLIT_LINES = [b"\x00\n", b"\x00\r\x01\n", b"\x02\r\xff"] class Mock: """Mock object for testing.""" def __init__(self, **kwargs): """Initialize the mock object with the given attributes.""" self.__dict__ = kwargs class TestMulti(TestCase): def test_compare_no_parent(self): diff = multiparent.MultiParent.from_lines(LINES_1) self.assertEqual([multiparent.NewText(LINES_1)], diff.hunks) def test_compare_one_parent(self): diff = multiparent.MultiParent.from_lines(LINES_1, [LINES_2]) self.assertEqual( [ multiparent.ParentText(0, 0, 0, 1), multiparent.NewText([b"b\n"]), multiparent.ParentText(0, 1, 2, 3), ], diff.hunks, ) diff = multiparent.MultiParent.from_lines(LINES_2, [LINES_1]) self.assertEqual( [multiparent.ParentText(0, 0, 0, 1), multiparent.ParentText(0, 2, 1, 3)], diff.hunks, ) def test_compare_two_parents(self): diff = multiparent.MultiParent.from_lines(LINES_1, [LINES_2, LINES_3]) self.assertEqual( [multiparent.ParentText(1, 0, 0, 4), multiparent.ParentText(0, 3, 4, 1)], diff.hunks, ) def test_compare_two_parents_blocks(self): matcher = patiencediff.PatienceSequenceMatcher(None, LINES_2, LINES_1) blocks = matcher.get_matching_blocks() diff = multiparent.MultiParent.from_lines( LINES_1, [LINES_2, LINES_3], left_blocks=blocks ) self.assertEqual( [multiparent.ParentText(1, 0, 0, 4), multiparent.ParentText(0, 3, 4, 1)], diff.hunks, ) def test_get_matching_blocks(self): diff = multiparent.MultiParent.from_lines(LINES_1, [LINES_2]) self.assertEqual( [(0, 0, 1), (1, 2, 3), (4, 5, 0)], list(diff.get_matching_blocks(0, len(LINES_2))), ) diff = multiparent.MultiParent.from_lines(LINES_2, [LINES_1]) self.assertEqual( [(0, 0, 1), (2, 1, 3), (5, 4, 0)], list(diff.get_matching_blocks(0, len(LINES_1))), ) def test_range_iterator(self): diff = multiparent.MultiParent.from_lines(LINES_1, [LINES_2, LINES_3]) diff.hunks.append(multiparent.NewText([b"q\n"])) self.assertEqual( [ (0, 4, "parent", (1, 0, 4)), (4, 5, "parent", (0, 3, 4)), (5, 6, "new", [b"q\n"]), ], list(diff.range_iterator()), ) def test_eq(self): diff = multiparent.MultiParent.from_lines(LINES_1) diff2 = multiparent.MultiParent.from_lines(LINES_1) self.assertEqual(diff, diff2) diff3 = multiparent.MultiParent.from_lines(LINES_2) self.assertNotEqual(diff, diff3) self.assertNotEqual(diff, Mock(hunks=[multiparent.NewText(LINES_1)])) self.assertEqual( multiparent.MultiParent( [multiparent.NewText(LINES_1), multiparent.ParentText(0, 1, 2, 3)] ), multiparent.MultiParent( [multiparent.NewText(LINES_1), multiparent.ParentText(0, 1, 2, 3)] ), ) def test_to_patch(self): self.assertEqual( [b"i 1\n", b"a\n", b"\n", b"c 0 1 2 3\n"], list( multiparent.MultiParent( [multiparent.NewText([b"a\n"]), multiparent.ParentText(0, 1, 2, 3)] ).to_patch() ), ) def test_from_patch(self): self.assertEqual( multiparent.MultiParent( [multiparent.NewText([b"a\n"]), multiparent.ParentText(0, 1, 2, 3)] ), multiparent.MultiParent.from_patch(b"i 1\na\n\nc 0 1 2 3"), ) self.assertEqual( multiparent.MultiParent( [multiparent.NewText([b"a"]), multiparent.ParentText(0, 1, 2, 3)] ), multiparent.MultiParent.from_patch(b"i 1\na\nc 0 1 2 3\n"), ) def test_binary_content(self): patch = list(multiparent.MultiParent.from_lines(LF_SPLIT_LINES).to_patch()) multiparent.MultiParent.from_patch(b"".join(patch)) def test_make_patch_from_binary(self): patch = multiparent.MultiParent.from_texts(b"".join(LF_SPLIT_LINES)) expected = multiparent.MultiParent([multiparent.NewText(LF_SPLIT_LINES)]) self.assertEqual(expected, patch) def test_num_lines(self): mp = multiparent.MultiParent([multiparent.NewText([b"a\n"])]) self.assertEqual(1, mp.num_lines()) mp.hunks.append(multiparent.NewText([b"b\n", b"c\n"])) self.assertEqual(3, mp.num_lines()) mp.hunks.append(multiparent.ParentText(0, 0, 3, 2)) self.assertEqual(5, mp.num_lines()) mp.hunks.append(multiparent.NewText([b"f\n", b"g\n"])) self.assertEqual(7, mp.num_lines()) def test_to_lines(self): mpdiff = multiparent.MultiParent.from_texts(b"a\nb\nc\n", (b"b\nc\n",)) lines = mpdiff.to_lines((b"b\ne\n",)) self.assertEqual([b"a\n", b"b\n", b"e\n"], lines) class TestNewText(TestCase): def test_eq(self): self.assertEqual(multiparent.NewText([]), multiparent.NewText([])) self.assertNotEqual(multiparent.NewText([b"a"]), multiparent.NewText([b"b"])) self.assertNotEqual(multiparent.NewText([b"a"]), Mock(lines=[b"a"])) def test_to_patch(self): self.assertEqual([b"i 0\n", b"\n"], list(multiparent.NewText([]).to_patch())) self.assertEqual( [b"i 1\n", b"a", b"\n"], list(multiparent.NewText([b"a"]).to_patch()) ) self.assertEqual( [b"i 1\n", b"a\n", b"\n"], list(multiparent.NewText([b"a\n"]).to_patch()) ) class TestParentText(TestCase): def test_eq(self): self.assertEqual( multiparent.ParentText(1, 2, 3, 4), multiparent.ParentText(1, 2, 3, 4) ) self.assertNotEqual( multiparent.ParentText(1, 2, 3, 4), multiparent.ParentText(2, 2, 3, 4) ) self.assertNotEqual( multiparent.ParentText(1, 2, 3, 4), Mock(parent=1, parent_pos=2, child_pos=3, num_lines=4), ) def test_to_patch(self): self.assertEqual( [b"c 0 1 2 3\n"], list(multiparent.ParentText(0, 1, 2, 3).to_patch()) ) REV_A = [b"a\n", b"b\n", b"c\n", b"d\n"] REV_B = [b"a\n", b"c\n", b"d\n", b"e\n"] REV_C = [b"a\n", b"b\n", b"e\n", b"f\n"] class TestVersionedFile(TestCase): def add_version(self, vf, text, version_id, parent_ids): vf.add_version( [(bytes([t]) + b"\n") for t in bytearray(text)], version_id, parent_ids ) def make_vf(self): vf = multiparent.MultiMemoryVersionedFile() self.add_version(vf, b"abcd", b"rev-a", []) self.add_version(vf, b"acde", b"rev-b", []) self.add_version(vf, b"abef", b"rev-c", [b"rev-a", b"rev-b"]) return vf def test_add_version(self): vf = self.make_vf() self.assertEqual(REV_A, vf._lines[b"rev-a"]) vf.clear_cache() self.assertEqual(vf._lines, {}) def test_get_line_list(self): vf = self.make_vf() vf.clear_cache() self.assertEqual(REV_A, vf.get_line_list([b"rev-a"])[0]) self.assertEqual([REV_B, REV_C], vf.get_line_list([b"rev-b", b"rev-c"])) def test_reconstruct_empty(self): vf = multiparent.MultiMemoryVersionedFile() vf.add_version([], b"a", []) self.assertEqual([], self.reconstruct_version(vf, b"a")) @staticmethod def reconstruct(vf, revision_id, start, end): reconstructor = multiparent._Reconstructor(vf, vf._lines, vf._parents) lines = [] reconstructor._reconstruct(lines, revision_id, start, end) return lines @staticmethod def reconstruct_version(vf, revision_id): reconstructor = multiparent._Reconstructor(vf, vf._lines, vf._parents) lines = [] reconstructor.reconstruct_version(lines, revision_id) return lines def test_reconstructor(self): vf = self.make_vf() self.assertEqual([b"a\n", b"b\n"], self.reconstruct(vf, b"rev-a", 0, 2)) self.assertEqual([b"c\n", b"d\n"], self.reconstruct(vf, b"rev-a", 2, 4)) self.assertEqual([b"e\n", b"f\n"], self.reconstruct(vf, b"rev-c", 2, 4)) self.assertEqual( [b"a\n", b"b\n", b"e\n", b"f\n"], self.reconstruct(vf, b"rev-c", 0, 4) ) self.assertEqual( [b"a\n", b"b\n", b"e\n", b"f\n"], self.reconstruct_version(vf, b"rev-c") ) def test_reordered(self): """Check for a corner case that requires re-starting the cursor.""" vf = multiparent.MultiMemoryVersionedFile() # rev-b must have at least two hunks, so split a and b with c. self.add_version(vf, b"c", b"rev-a", []) self.add_version(vf, b"acb", b"rev-b", [b"rev-a"]) # rev-c and rev-d must each have a line from a different rev-b hunk self.add_version(vf, b"b", b"rev-c", [b"rev-b"]) self.add_version(vf, b"a", b"rev-d", [b"rev-b"]) # The lines from rev-c and rev-d must appear in the opposite order self.add_version(vf, b"ba", b"rev-e", [b"rev-c", b"rev-d"]) vf.clear_cache() lines = vf.get_line_list([b"rev-e"])[0] self.assertEqual([b"b\n", b"a\n"], lines) class TestMultiVersionedFile(tests.TestCaseInTempDir): def test_save_load(self): vf = multiparent.MultiVersionedFile("foop") vf.add_version(b"a\nb\nc\nd".splitlines(True), b"a", []) vf.add_version(b"a\ne\nd\n".splitlines(True), b"b", [b"a"]) vf.save() newvf = multiparent.MultiVersionedFile("foop") newvf.load() self.assertEqual(b"a\nb\nc\nd", b"".join(newvf.get_line_list([b"a"])[0])) self.assertEqual(b"a\ne\nd\n", b"".join(newvf.get_line_list([b"b"])[0])) def test_filenames(self): vf = multiparent.MultiVersionedFile("foop") vf.add_version(b"a\nb\nc\nd".splitlines(True), b"a", []) self.assertPathExists("foop.mpknit") self.assertPathDoesNotExist("foop.mpidx") vf.save() self.assertPathExists("foop.mpidx") vf.destroy() self.assertPathDoesNotExist("foop.mpknit") self.assertPathDoesNotExist("foop.mpidx") bzrformats_3.4.0.orig/bzrformats/tests/test_osutils.py0000644000000000000000000002403415162115107020326 0ustar00# Copyright (C) 2005-2012, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for bzrformats osutils.""" import hashlib import os from .. import osutils from . import TestCase, TestCaseInTempDir class TestShaFunctions(TestCase): """Test the sha_string and sha_strings functions.""" def test_sha_string_bytes(self): """Test sha_string with bytes input.""" result = osutils.sha_string(b"hello world") expected = hashlib.sha1(b"hello world").hexdigest().encode("ascii") # noqa: S324 self.assertEqual(expected, result) def test_sha_string_unicode(self): """Test sha_string with unicode input.""" result = osutils.sha_string("hello world") expected = hashlib.sha1(b"hello world").hexdigest().encode("ascii") # noqa: S324 self.assertEqual(expected, result) def test_sha_strings(self): """Test sha_strings with mixed input.""" result = osutils.sha_strings([b"hello", " ", "world"]) sha = hashlib.sha1() # noqa: S324 sha.update(b"hello") sha.update(b" ") sha.update(b"world") expected = sha.hexdigest().encode("ascii") self.assertEqual(expected, result) class TestOsutilsFunctions(TestCase): """Test various osutils functions.""" def test_split_unicode(self): """Test split with unicode paths.""" dirname, basename = osutils.split("foo/bar") self.assertEqual("foo", dirname) self.assertEqual("bar", basename) def test_split_bytes(self): """Test split with byte paths.""" dirname, basename = osutils.split(b"foo/bar") self.assertEqual(b"foo", dirname) self.assertEqual(b"bar", basename) def test_pathjoin_unicode(self): """Test pathjoin with unicode paths.""" result = osutils.pathjoin("foo", "bar", "baz") self.assertEqual(os.path.join("foo", "bar", "baz"), result) def test_pathjoin_bytes(self): """Test pathjoin with byte paths.""" result = osutils.pathjoin(b"foo", b"bar", b"baz") self.assertEqual(os.path.join(b"foo", b"bar", b"baz"), result) def test_basename_unicode(self): """Test basename with unicode path.""" result = osutils.basename("foo/bar/baz") self.assertEqual("baz", result) def test_basename_bytes(self): """Test basename with byte path.""" result = osutils.basename(b"foo/bar/baz") self.assertEqual(b"baz", result) def test_dirname_unicode(self): """Test dirname with unicode path.""" result = osutils.dirname("foo/bar/baz") self.assertEqual("foo/bar", result) def test_dirname_bytes(self): """Test dirname with byte path.""" result = osutils.dirname(b"foo/bar/baz") self.assertEqual(b"foo/bar", result) def test_splitpath(self): """Test splitpath function.""" self.assertEqual(["foo", "bar"], osutils.splitpath("foo/bar")) self.assertEqual(["foo", "bar"], osutils.splitpath("/foo/bar")) self.assertEqual([b"foo", b"bar"], osutils.splitpath(b"foo/bar")) self.assertEqual([b"foo", b"bar"], osutils.splitpath(b"/foo/bar")) self.assertEqual([], osutils.splitpath("")) self.assertEqual([], osutils.splitpath("/")) def test_contains_whitespace(self): """Test contains_whitespace function.""" self.assertTrue(osutils.contains_whitespace("hello world")) self.assertTrue(osutils.contains_whitespace("hello\tworld")) self.assertTrue(osutils.contains_whitespace("hello\nworld")) self.assertFalse(osutils.contains_whitespace("helloworld")) # Test bytes self.assertTrue(osutils.contains_whitespace(b"hello world")) self.assertFalse(osutils.contains_whitespace(b"helloworld")) def test_normalized_filename(self): """Test normalized_filename function.""" # Simple ASCII filename result, can_access = osutils.normalized_filename("test.txt") self.assertEqual("test.txt", result) self.assertTrue(can_access) # Bytes filename result, can_access = osutils.normalized_filename(b"test.txt") self.assertEqual(b"test.txt", result) self.assertTrue(can_access) def test_chunks_to_lines(self): """Test chunks_to_lines function.""" chunks = [b"line1\n", b"line2\nli", b"ne3\n"] result = osutils.chunks_to_lines(chunks) self.assertEqual([b"line1\n", b"line2\n", b"line3\n"], result) # Test with no newline at end chunks = [b"line1\n", b"line2"] result = osutils.chunks_to_lines(chunks) self.assertEqual([b"line1\n", b"line2"], result) # Test empty chunks self.assertEqual([], osutils.chunks_to_lines([])) def test_chunks_to_lines_iter(self): """Test chunks_to_lines_iter function.""" chunks = iter([b"line1\n", b"line2\nli", b"ne3\n"]) result = list(osutils.chunks_to_lines_iter(chunks)) self.assertEqual([b"line1\n", b"line2\n", b"line3\n"], result) class TestRustOsutilsFunctions(TestCase): """Test the Rust-based osutils functions.""" def test_rand_chars(self): """Test rand_chars generates the right length string.""" result = osutils.rand_chars(10) self.assertEqual(10, len(result)) # Should only contain alphanumeric characters self.assertTrue(all(c.isalnum() for c in result)) def test_is_inside(self): """Test is_inside function.""" # Should work with both strings and bytes self.assertTrue(osutils.is_inside("/home", "/home/user")) self.assertTrue(osutils.is_inside("/home/", "/home/user")) self.assertFalse(osutils.is_inside("/home", "/usr/bin")) self.assertFalse(osutils.is_inside("/home/user", "/home")) def test_is_inside_any(self): """Test is_inside_any function.""" dirs = ["/home", "/usr"] self.assertTrue(osutils.is_inside_any(dirs, "/home/user")) self.assertTrue(osutils.is_inside_any(dirs, "/usr/bin")) self.assertFalse(osutils.is_inside_any(dirs, "/var/log")) def test_parent_directories(self): """Test parent_directories function.""" result = osutils.parent_directories("/home/user/documents/file.txt") # Convert to list since it returns an iterator parents = list(result) self.assertIn("/home/user/documents", parents) self.assertIn("/home/user", parents) self.assertIn("/home", parents) class TestFileIterator(TestCase): """Test file_iterator function.""" def test_file_iterator(self): """Test iterating over file contents.""" import io content = b"a" * 100000 # 100KB of data file_obj = io.BytesIO(content) chunks = list(osutils.file_iterator(file_obj, chunk_size=1024)) # Should have multiple chunks self.assertTrue(len(chunks) > 1) # Reassemble and check reassembled = b"".join(chunks) self.assertEqual(content, reassembled) # Check chunk sizes (all but last should be 1024) for chunk in chunks[:-1]: self.assertEqual(1024, len(chunk)) class TestPumpfile(TestCaseInTempDir): """Test pumpfile function.""" def test_pumpfile(self): """Test copying data between file objects.""" import io # Create source with some data source_data = b"Hello, world!" * 1000 source = io.BytesIO(source_data) # Create destination dest = io.BytesIO() # Pump the data bytes_copied = osutils.pumpfile(source, dest) # Check the result self.assertEqual(len(source_data), bytes_copied) self.assertEqual(source_data, dest.getvalue()) class TestFileKindFromStatMode(TestCase): """Test file_kind_from_stat_mode function.""" def test_regular_file(self): """Test regular file detection.""" import stat mode = stat.S_IFREG | 0o644 self.assertEqual("file", osutils.file_kind_from_stat_mode(mode)) def test_directory(self): """Test directory detection.""" import stat mode = stat.S_IFDIR | 0o755 self.assertEqual("directory", osutils.file_kind_from_stat_mode(mode)) def test_symlink(self): """Test symlink detection.""" import stat mode = stat.S_IFLNK | 0o777 self.assertEqual("symlink", osutils.file_kind_from_stat_mode(mode)) def test_fifo(self): """Test FIFO detection.""" import stat mode = stat.S_IFIFO | 0o666 self.assertEqual("fifo", osutils.file_kind_from_stat_mode(mode)) def test_socket(self): """Test socket detection.""" import stat mode = stat.S_IFSOCK | 0o666 self.assertEqual("socket", osutils.file_kind_from_stat_mode(mode)) def test_char_device(self): """Test character device detection.""" import stat mode = stat.S_IFCHR | 0o666 self.assertEqual("chardev", osutils.file_kind_from_stat_mode(mode)) def test_block_device(self): """Test block device detection.""" import stat mode = stat.S_IFBLK | 0o666 self.assertEqual("block", osutils.file_kind_from_stat_mode(mode)) # Add test module discovery def test_suite(): """Return the test suite for osutils tests.""" import unittest return unittest.TestLoader().loadTestsFromModule(sys.modules[__name__]) if __name__ == "__main__": import unittest unittest.main() bzrformats_3.4.0.orig/bzrformats/tests/test_pack.py0000644000000000000000000007345415162115103017550 0ustar00# Copyright (C) 2007, 2009, 2011, 2012, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for bzrformats.pack.""" from io import BytesIO from .. import pack from . import TestCase, TestCaseWithMemoryTransport class TestContainerSerialiser(TestCase): """Tests for the ContainerSerialiser class.""" def test_construct(self): """Test constructing a ContainerSerialiser.""" pack.ContainerSerialiser() def test_begin(self): serialiser = pack.ContainerSerialiser() self.assertEqual( b"Bazaar pack format 1 (introduced in 0.18)\n", serialiser.begin() ) def test_end(self): serialiser = pack.ContainerSerialiser() self.assertEqual(b"E", serialiser.end()) def test_bytes_record_no_name(self): serialiser = pack.ContainerSerialiser() record = serialiser.bytes_record(b"bytes", []) self.assertEqual(b"B5\n\nbytes", record) def test_bytes_record_one_name_with_one_part(self): serialiser = pack.ContainerSerialiser() record = serialiser.bytes_record(b"bytes", [(b"name",)]) self.assertEqual(b"B5\nname\n\nbytes", record) def test_bytes_record_one_name_with_two_parts(self): serialiser = pack.ContainerSerialiser() record = serialiser.bytes_record(b"bytes", [(b"part1", b"part2")]) self.assertEqual(b"B5\npart1\x00part2\n\nbytes", record) def test_bytes_record_two_names(self): serialiser = pack.ContainerSerialiser() record = serialiser.bytes_record(b"bytes", [(b"name1",), (b"name2",)]) self.assertEqual(b"B5\nname1\nname2\n\nbytes", record) def test_bytes_record_whitespace_in_name_part(self): serialiser = pack.ContainerSerialiser() self.assertRaises( pack.InvalidRecordError, serialiser.bytes_record, b"bytes", [(b"bad name",)] ) def test_bytes_record_header(self): serialiser = pack.ContainerSerialiser() record = serialiser.bytes_header(32, [(b"name1",), (b"name2",)]) self.assertEqual(b"B32\nname1\nname2\n\n", record) class TestContainerWriter(TestCase): def setUp(self): super().setUp() self.output = BytesIO() self.writer = pack.ContainerWriter(self.output.write) def assertOutput(self, expected_output): """Assert that the output of self.writer ContainerWriter is equal to expected_output. """ self.assertEqual(expected_output, self.output.getvalue()) def test_construct(self): """Test constructing a ContainerWriter. This uses None as the output stream to show that the constructor doesn't try to use the output stream. """ pack.ContainerWriter(None) def test_begin(self): """The begin() method writes the container format marker line.""" self.writer.begin() self.assertOutput(b"Bazaar pack format 1 (introduced in 0.18)\n") def test_zero_records_written_after_begin(self): """After begin is written, 0 records have been written.""" self.writer.begin() self.assertEqual(0, self.writer.records_written) def test_end(self): """The end() method writes an End Marker record.""" self.writer.begin() self.writer.end() self.assertOutput(b"Bazaar pack format 1 (introduced in 0.18)\nE") def test_empty_end_does_not_add_a_record_to_records_written(self): """The end() method does not count towards the records written.""" self.writer.begin() self.writer.end() self.assertEqual(0, self.writer.records_written) def test_non_empty_end_does_not_add_a_record_to_records_written(self): """The end() method does not count towards the records written.""" self.writer.begin() self.writer.add_bytes_record([b"foo"], len(b"foo"), names=[]) self.writer.end() self.assertEqual(1, self.writer.records_written) def test_add_bytes_record_no_name(self): """Add a bytes record with no name.""" self.writer.begin() offset, length = self.writer.add_bytes_record([b"abc"], len(b"abc"), names=[]) self.assertEqual((42, 7), (offset, length)) self.assertOutput(b"Bazaar pack format 1 (introduced in 0.18)\nB3\n\nabc") def test_add_bytes_record_one_name(self): """Add a bytes record with one name.""" self.writer.begin() offset, length = self.writer.add_bytes_record( [b"abc"], len(b"abc"), names=[(b"name1",)] ) self.assertEqual((42, 13), (offset, length)) self.assertOutput( b"Bazaar pack format 1 (introduced in 0.18)\nB3\nname1\n\nabc" ) def test_add_bytes_record_split_writes(self): """Write a large record which does multiple IOs.""" writes = [] real_write = self.writer.write_func def record_writes(data): writes.append(data) return real_write(data) self.writer.write_func = record_writes self.writer._JOIN_WRITES_THRESHOLD = 2 self.writer.begin() offset, length = self.writer.add_bytes_record( [b"abcabc"], len(b"abcabc"), names=[(b"name1",)] ) self.assertEqual((42, 16), (offset, length)) self.assertOutput( b"Bazaar pack format 1 (introduced in 0.18)\nB6\nname1\n\nabcabc" ) self.assertEqual( [ b"Bazaar pack format 1 (introduced in 0.18)\n", b"B6\nname1\n\n", b"abcabc", ], writes, ) def test_add_bytes_record_two_names(self): """Add a bytes record with two names.""" self.writer.begin() offset, length = self.writer.add_bytes_record( [b"abc"], len(b"abc"), names=[(b"name1",), (b"name2",)] ) self.assertEqual((42, 19), (offset, length)) self.assertOutput( b"Bazaar pack format 1 (introduced in 0.18)\nB3\nname1\nname2\n\nabc" ) def test_add_bytes_record_two_element_name(self): """Add a bytes record with a two-element name.""" self.writer.begin() offset, length = self.writer.add_bytes_record( [b"abc"], len(b"abc"), names=[(b"name1", b"name2")] ) self.assertEqual((42, 19), (offset, length)) self.assertOutput( b"Bazaar pack format 1 (introduced in 0.18)\nB3\nname1\x00name2\n\nabc" ) def test_add_second_bytes_record_gets_higher_offset(self): self.writer.begin() self.writer.add_bytes_record([b"a", b"bc"], len(b"abc"), names=[]) offset, length = self.writer.add_bytes_record([b"abc"], len(b"abc"), names=[]) self.assertEqual((49, 7), (offset, length)) self.assertOutput( b"Bazaar pack format 1 (introduced in 0.18)\nB3\n\nabcB3\n\nabc" ) def test_add_bytes_record_invalid_name(self): """Adding a Bytes record with a name with whitespace in it raises InvalidRecordError. """ self.writer.begin() self.assertRaises( pack.InvalidRecordError, self.writer.add_bytes_record, [b"abc"], len(b"abc"), names=[(b"bad name",)], ) def test_add_bytes_records_add_to_records_written(self): """Adding a Bytes record increments the records_written counter.""" self.writer.begin() self.writer.add_bytes_record([b"foo"], len(b"foo"), names=[]) self.assertEqual(1, self.writer.records_written) self.writer.add_bytes_record([b"foo"], len(b"foo"), names=[]) self.assertEqual(2, self.writer.records_written) class TestContainerReader(TestCase): """Tests for the ContainerReader. The ContainerReader reads format 1 containers, so these tests explicitly test how it reacts to format 1 data. If a new version of the format is added, then separate tests for that format should be added. """ def get_reader_for(self, data): stream = BytesIO(data) reader = pack.ContainerReader(stream) return reader def test_construct(self): """Test constructing a ContainerReader. This uses None as the output stream to show that the constructor doesn't try to use the input stream. """ pack.ContainerReader(None) def test_empty_container(self): """Read an empty container.""" reader = self.get_reader_for(b"Bazaar pack format 1 (introduced in 0.18)\nE") self.assertEqual([], list(reader.iter_records())) def test_unknown_format(self): """Unrecognised container formats raise UnknownContainerFormatError.""" reader = self.get_reader_for(b"unknown format\n") self.assertRaises(pack.UnknownContainerFormatError, reader.iter_records) def test_unexpected_end_of_container(self): """Containers that don't end with an End Marker record should cause UnexpectedEndOfContainerError to be raised. """ reader = self.get_reader_for(b"Bazaar pack format 1 (introduced in 0.18)\n") iterator = reader.iter_records() self.assertRaises(pack.UnexpectedEndOfContainerError, next, iterator) def test_unknown_record_type(self): """Unknown record types cause UnknownRecordTypeError to be raised.""" reader = self.get_reader_for(b"Bazaar pack format 1 (introduced in 0.18)\nX") iterator = reader.iter_records() self.assertRaises(pack.UnknownRecordTypeError, next, iterator) def test_container_with_one_unnamed_record(self): """Read a container with one Bytes record. Parsing Bytes records is more thoroughly exercised by TestBytesRecordReader. This test is here to ensure that ContainerReader's integration with BytesRecordReader is working. """ reader = self.get_reader_for( b"Bazaar pack format 1 (introduced in 0.18)\nB5\n\naaaaaE" ) expected_records = [([], b"aaaaa")] self.assertEqual( expected_records, [ (names, read_bytes(None)) for (names, read_bytes) in reader.iter_records() ], ) def test_validate_empty_container(self): """Validate does not raise an error for a container with no records.""" reader = self.get_reader_for(b"Bazaar pack format 1 (introduced in 0.18)\nE") # No exception raised reader.validate() def test_validate_non_empty_valid_container(self): """Validate does not raise an error for a container with a valid record.""" reader = self.get_reader_for( b"Bazaar pack format 1 (introduced in 0.18)\nB3\nname\n\nabcE" ) # No exception raised reader.validate() def test_validate_bad_format(self): """Validate raises an error for unrecognised format strings. It may raise either UnexpectedEndOfContainerError or UnknownContainerFormatError, depending on exactly what the string is. """ inputs = [b"", b"x", b"Bazaar pack format 1 (introduced in 0.18)", b"bad\n"] for input in inputs: reader = self.get_reader_for(input) self.assertRaises( (pack.UnexpectedEndOfContainerError, pack.UnknownContainerFormatError), reader.validate, ) def test_validate_bad_record_marker(self): """Validate raises UnknownRecordTypeError for unrecognised record types. """ reader = self.get_reader_for(b"Bazaar pack format 1 (introduced in 0.18)\nX") self.assertRaises(pack.UnknownRecordTypeError, reader.validate) def test_validate_data_after_end_marker(self): """Validate raises ContainerHasExcessDataError if there are any bytes after the end of the container. """ reader = self.get_reader_for( b"Bazaar pack format 1 (introduced in 0.18)\nEcrud" ) self.assertRaises(pack.ContainerHasExcessDataError, reader.validate) def test_validate_no_end_marker(self): """Validate raises UnexpectedEndOfContainerError if there's no end of container marker, even if the container up to this point has been valid. """ reader = self.get_reader_for(b"Bazaar pack format 1 (introduced in 0.18)\n") self.assertRaises(pack.UnexpectedEndOfContainerError, reader.validate) def test_validate_duplicate_name(self): """Validate raises DuplicateRecordNameError if the same name occurs multiple times in the container. """ reader = self.get_reader_for( b"Bazaar pack format 1 (introduced in 0.18)\nB0\nname\n\nB0\nname\n\nE" ) self.assertRaises(pack.DuplicateRecordNameError, reader.validate) def test_validate_undecodeable_name(self): """Names that aren't valid UTF-8 cause validate to fail.""" reader = self.get_reader_for( b"Bazaar pack format 1 (introduced in 0.18)\nB0\n\xcc\n\nE" ) self.assertRaises(pack.InvalidRecordError, reader.validate) class TestBytesRecordReader(TestCase): """Tests for reading and validating Bytes records with BytesRecordReader. Like TestContainerReader, this explicitly tests the reading of format 1 data. If a new version of the format is added, then a separate set of tests for reading that format should be added. """ def get_reader_for(self, data): stream = BytesIO(data) reader = pack.BytesRecordReader(stream) return reader def test_record_with_no_name(self): """Reading a Bytes record with no name returns an empty list of names. """ reader = self.get_reader_for(b"5\n\naaaaa") names, get_bytes = reader.read() self.assertEqual([], names) self.assertEqual(b"aaaaa", get_bytes(None)) def test_record_with_one_name(self): """Reading a Bytes record with one name returns a list of just that name. """ reader = self.get_reader_for(b"5\nname1\n\naaaaa") names, get_bytes = reader.read() self.assertEqual([(b"name1",)], names) self.assertEqual(b"aaaaa", get_bytes(None)) def test_record_with_two_names(self): """Reading a Bytes record with two names returns a list of both names.""" reader = self.get_reader_for(b"5\nname1\nname2\n\naaaaa") names, get_bytes = reader.read() self.assertEqual([(b"name1",), (b"name2",)], names) self.assertEqual(b"aaaaa", get_bytes(None)) def test_record_with_two_part_names(self): """Reading a Bytes record with a two_part name reads both.""" reader = self.get_reader_for(b"5\nname1\x00name2\n\naaaaa") names, get_bytes = reader.read() self.assertEqual( [ ( b"name1", b"name2", ) ], names, ) self.assertEqual(b"aaaaa", get_bytes(None)) def test_invalid_length(self): """If the length-prefix is not a number, parsing raises InvalidRecordError. """ reader = self.get_reader_for(b"not a number\n") self.assertRaises(pack.InvalidRecordError, reader.read) def test_early_eof(self): """Tests for premature EOF occuring during parsing Bytes records with BytesRecordReader. A incomplete container might be interrupted at any point. The BytesRecordReader needs to cope with the input stream running out no matter where it is in the parsing process. In all cases, UnexpectedEndOfContainerError should be raised. """ complete_record = b"6\nname\n\nabcdef" for count in range(0, len(complete_record)): incomplete_record = complete_record[:count] reader = self.get_reader_for(incomplete_record) # We don't use assertRaises to make diagnosing failures easier # (assertRaises doesn't allow a custom failure message). try: _names, read_bytes = reader.read() read_bytes(None) except pack.UnexpectedEndOfContainerError: pass else: self.fail( f"UnexpectedEndOfContainerError not raised when parsing {incomplete_record!r}" ) def test_initial_eof(self): """EOF before any bytes read at all.""" reader = self.get_reader_for(b"") self.assertRaises(pack.UnexpectedEndOfContainerError, reader.read) def test_eof_after_length(self): """EOF after reading the length and before reading name(s).""" reader = self.get_reader_for(b"123\n") self.assertRaises(pack.UnexpectedEndOfContainerError, reader.read) def test_eof_during_name(self): """EOF during reading a name.""" reader = self.get_reader_for(b"123\nname") self.assertRaises(pack.UnexpectedEndOfContainerError, reader.read) def test_read_invalid_name_whitespace(self): """Names must have no whitespace.""" # A name with a space. reader = self.get_reader_for(b"0\nbad name\n\n") self.assertRaises(pack.InvalidRecordError, reader.read) # A name with a tab. reader = self.get_reader_for(b"0\nbad\tname\n\n") self.assertRaises(pack.InvalidRecordError, reader.read) # A name with a vertical tab. reader = self.get_reader_for(b"0\nbad\vname\n\n") self.assertRaises(pack.InvalidRecordError, reader.read) def test_validate_whitespace_in_name(self): """Names must have no whitespace.""" reader = self.get_reader_for(b"0\nbad name\n\n") self.assertRaises(pack.InvalidRecordError, reader.validate) def test_validate_interrupted_prelude(self): """EOF during reading a record's prelude causes validate to fail.""" reader = self.get_reader_for(b"") self.assertRaises(pack.UnexpectedEndOfContainerError, reader.validate) def test_validate_interrupted_body(self): """EOF during reading a record's body causes validate to fail.""" reader = self.get_reader_for(b"1\n\n") self.assertRaises(pack.UnexpectedEndOfContainerError, reader.validate) def test_validate_unparseable_length(self): """An unparseable record length causes validate to fail.""" reader = self.get_reader_for(b"\n\n") self.assertRaises(pack.InvalidRecordError, reader.validate) def test_validate_undecodeable_name(self): """Names that aren't valid UTF-8 cause validate to fail.""" reader = self.get_reader_for(b"0\n\xcc\n\n") self.assertRaises(pack.InvalidRecordError, reader.validate) def test_read_max_length(self): """If the max_length passed to the callable returned by read is not None, then no more than that many bytes will be read. """ reader = self.get_reader_for(b"6\n\nabcdef") _names, get_bytes = reader.read() self.assertEqual(b"abc", get_bytes(3)) def test_read_no_max_length(self): """If the max_length passed to the callable returned by read is None, then all the bytes in the record will be read. """ reader = self.get_reader_for(b"6\n\nabcdef") _names, get_bytes = reader.read() self.assertEqual(b"abcdef", get_bytes(None)) def test_repeated_read_calls(self): """Repeated calls to the callable returned from BytesRecordReader.read will not read beyond the end of the record. """ reader = self.get_reader_for(b"6\n\nabcdefB3\nnext-record\nXXX") _names, get_bytes = reader.read() self.assertEqual(b"abcdef", get_bytes(None)) self.assertEqual(b"", get_bytes(None)) self.assertEqual(b"", get_bytes(99)) class TestMakeReadvReader(TestCaseWithMemoryTransport): def test_read_skipping_records(self): pack_data = BytesIO() writer = pack.ContainerWriter(pack_data.write) writer.begin() memos = [] memos.append(writer.add_bytes_record([b"abc"], 3, names=[])) memos.append(writer.add_bytes_record([b"def"], 3, names=[(b"name1",)])) memos.append(writer.add_bytes_record([b"ghi"], 3, names=[(b"name2",)])) memos.append(writer.add_bytes_record([b"jkl"], 3, names=[])) writer.end() transport = self.get_transport() transport.put_bytes("mypack", pack_data.getvalue()) requested_records = [memos[0], memos[2]] reader = pack.make_readv_reader(transport, "mypack", requested_records) result = [] for names, reader_func in reader.iter_records(): result.append((names, reader_func(None))) self.assertEqual([([], b"abc"), ([(b"name2",)], b"ghi")], result) class TestReadvFile(TestCaseWithMemoryTransport): """Tests of the ReadVFile class. Error cases are deliberately undefined: this code adapts the underlying transport interface to a single 'streaming read' interface as ContainerReader needs. """ def test_read_bytes(self): """Test reading of both single bytes and all bytes in a hunk.""" transport = self.get_transport() transport.put_bytes("sample", b"0123456789") f = pack.ReadVFile(transport.readv("sample", [(0, 1), (1, 2), (4, 1), (6, 2)])) results = [] results.append(f.read(1)) results.append(f.read(2)) results.append(f.read(1)) results.append(f.read(1)) results.append(f.read(1)) self.assertEqual([b"0", b"12", b"4", b"6", b"7"], results) def test_readline(self): """Test using readline() as ContainerReader does. This is always within a readv hunk, never across it. """ transport = self.get_transport() transport.put_bytes("sample", b"0\n2\n4\n") f = pack.ReadVFile(transport.readv("sample", [(0, 2), (2, 4)])) results = [] results.append(f.readline()) results.append(f.readline()) results.append(f.readline()) self.assertEqual([b"0\n", b"2\n", b"4\n"], results) def test_readline_and_read(self): """Test exercising one byte reads, readline, and then read again.""" transport = self.get_transport() transport.put_bytes("sample", b"0\n2\n4\n") f = pack.ReadVFile(transport.readv("sample", [(0, 6)])) results = [] results.append(f.read(1)) results.append(f.readline()) results.append(f.read(4)) self.assertEqual([b"0", b"\n", b"2\n4\n"], results) class PushParserTestCase(TestCase): """Base class for TestCases involving ContainerPushParser.""" def make_parser_expecting_record_type(self): parser = pack.ContainerPushParser() parser.accept_bytes(b"Bazaar pack format 1 (introduced in 0.18)\n") return parser def make_parser_expecting_bytes_record(self): parser = pack.ContainerPushParser() parser.accept_bytes(b"Bazaar pack format 1 (introduced in 0.18)\nB") return parser def assertRecordParsing(self, expected_record, data): """Assert that 'bytes' is parsed as a given bytes record. :param expected_record: A tuple of (names, bytes). """ parser = self.make_parser_expecting_bytes_record() parser.accept_bytes(data) parsed_records = parser.read_pending_records() self.assertEqual([expected_record], parsed_records) class TestContainerPushParser(PushParserTestCase): """Tests for ContainerPushParser. The ContainerPushParser reads format 1 containers, so these tests explicitly test how it reacts to format 1 data. If a new version of the format is added, then separate tests for that format should be added. """ def test_construct(self): """ContainerPushParser can be constructed.""" pack.ContainerPushParser() def test_multiple_records_at_once(self): """If multiple records worth of data are fed to the parser in one string, the parser will correctly parse all the records. (A naive implementation might stop after parsing the first record.) """ parser = self.make_parser_expecting_record_type() parser.accept_bytes(b"B5\nname1\n\nbody1B5\nname2\n\nbody2") self.assertEqual( [([(b"name1",)], b"body1"), ([(b"name2",)], b"body2")], parser.read_pending_records(), ) def test_multiple_empty_records_at_once(self): """If multiple empty records worth of data are fed to the parser in one string, the parser will correctly parse all the records. (A naive implementation might stop after parsing the first empty record, because the buffer size had not changed.) """ parser = self.make_parser_expecting_record_type() parser.accept_bytes(b"B0\nname1\n\nB0\nname2\n\n") self.assertEqual( [([(b"name1",)], b""), ([(b"name2",)], b"")], parser.read_pending_records() ) class TestContainerPushParserBytesParsing(PushParserTestCase): """Tests for reading Bytes records with ContainerPushParser. The ContainerPushParser reads format 1 containers, so these tests explicitly test how it reacts to format 1 data. If a new version of the format is added, then separate tests for that format should be added. """ def test_record_with_no_name(self): """Reading a Bytes record with no name returns an empty list of names. """ self.assertRecordParsing(([], b"aaaaa"), b"5\n\naaaaa") def test_record_with_one_name(self): """Reading a Bytes record with one name returns a list of just that name. """ self.assertRecordParsing(([(b"name1",)], b"aaaaa"), b"5\nname1\n\naaaaa") def test_record_with_two_names(self): """Reading a Bytes record with two names returns a list of both names.""" self.assertRecordParsing( ([(b"name1",), (b"name2",)], b"aaaaa"), b"5\nname1\nname2\n\naaaaa" ) def test_record_with_two_part_names(self): """Reading a Bytes record with a two_part name reads both.""" self.assertRecordParsing( ([(b"name1", b"name2")], b"aaaaa"), b"5\nname1\x00name2\n\naaaaa" ) def test_invalid_length(self): """If the length-prefix is not a number, parsing raises InvalidRecordError. """ parser = self.make_parser_expecting_bytes_record() self.assertRaises( pack.InvalidRecordError, parser.accept_bytes, b"not a number\n" ) def test_incomplete_record(self): """If the bytes seen so far don't form a complete record, then there will be nothing returned by read_pending_records. """ parser = self.make_parser_expecting_bytes_record() parser.accept_bytes(b"5\n\nabcd") self.assertEqual([], parser.read_pending_records()) def test_accept_nothing(self): """The edge case of parsing an empty string causes no error.""" parser = self.make_parser_expecting_bytes_record() parser.accept_bytes(b"") def assertInvalidRecord(self, data): """Assert that parsing the given bytes raises InvalidRecordError.""" parser = self.make_parser_expecting_bytes_record() self.assertRaises(pack.InvalidRecordError, parser.accept_bytes, data) def test_read_invalid_name_whitespace(self): """Names must have no whitespace.""" # A name with a space. self.assertInvalidRecord(b"0\nbad name\n\n") # A name with a tab. self.assertInvalidRecord(b"0\nbad\tname\n\n") # A name with a vertical tab. self.assertInvalidRecord(b"0\nbad\vname\n\n") def test_repeated_read_pending_records(self): """read_pending_records will not return the same record twice.""" parser = self.make_parser_expecting_bytes_record() parser.accept_bytes(b"6\n\nabcdef") self.assertEqual([([], b"abcdef")], parser.read_pending_records()) self.assertEqual([], parser.read_pending_records()) class TestErrors(TestCase): def test_unknown_container_format(self): """Test the formatting of UnknownContainerFormatError.""" e = pack.UnknownContainerFormatError("bad format string") self.assertEqual("Unrecognised container format: 'bad format string'", str(e)) def test_unexpected_end_of_container(self): """Test the formatting of UnexpectedEndOfContainerError.""" e = pack.UnexpectedEndOfContainerError() self.assertEqual("Unexpected end of container stream", str(e)) def test_unknown_record_type(self): """Test the formatting of UnknownRecordTypeError.""" e = pack.UnknownRecordTypeError("X") self.assertEqual("Unknown record type: 'X'", str(e)) def test_invalid_record(self): """Test the formatting of InvalidRecordError.""" e = pack.InvalidRecordError("xxx") self.assertEqual("Invalid record: xxx", str(e)) def test_container_has_excess_data(self): """Test the formatting of ContainerHasExcessDataError.""" e = pack.ContainerHasExcessDataError("excess bytes") self.assertEqual("Container has data after end marker: 'excess bytes'", str(e)) def test_duplicate_record_name_error(self): """Test the formatting of DuplicateRecordNameError.""" e = pack.DuplicateRecordNameError(b"n\xc3\xa5me") self.assertEqual( "Container has multiple records with the same name: n\xe5me", str(e) ) bzrformats_3.4.0.orig/bzrformats/tests/test_registry.py0000644000000000000000000003532715162115103020477 0ustar00# Copyright (C) 2006, 2008-2012, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for the Registry classes.""" import os import sys from .. import osutils, registry from . import TestCase, TestCaseInTempDir class TestRegistry(TestCase): def register_stuff(self, a_registry): a_registry.register("one", 1) a_registry.register("two", 2) a_registry.register("four", 4) a_registry.register("five", 5) def test_registry(self): a_registry = registry.Registry() self.register_stuff(a_registry) self.assertIsNone(a_registry.default_key) # test get() (self.default_key is None) self.assertRaises(KeyError, a_registry.get) self.assertRaises(KeyError, a_registry.get, None) self.assertEqual(2, a_registry.get("two")) self.assertRaises(KeyError, a_registry.get, "three") # test _set_default_key a_registry.default_key = "five" self.assertEqual(a_registry.default_key, "five") self.assertEqual(5, a_registry.get()) self.assertEqual(5, a_registry.get(None)) # If they ask for a specific entry, they should get KeyError # not the default value. They can always pass None if they prefer self.assertRaises(KeyError, a_registry.get, "six") self.assertRaises(KeyError, a_registry._set_default_key, "six") # test keys() self.assertEqual(["five", "four", "one", "two"], a_registry.keys()) def test_registry_funcs(self): a_registry = registry.Registry() self.register_stuff(a_registry) self.assertIn("one", a_registry) a_registry.remove("one") self.assertNotIn("one", a_registry) self.assertRaises(KeyError, a_registry.get, "one") a_registry.register("one", "one") self.assertEqual(["five", "four", "one", "two"], sorted(a_registry.keys())) self.assertEqual( [("five", 5), ("four", 4), ("one", "one"), ("two", 2)], sorted(a_registry.iteritems()), ) def test_register_override(self): a_registry = registry.Registry() a_registry.register("one", "one") self.assertRaises(KeyError, a_registry.register, "one", "two") self.assertRaises( KeyError, a_registry.register, "one", "two", override_existing=False ) a_registry.register("one", "two", override_existing=True) self.assertEqual("two", a_registry.get("one")) self.assertRaises(KeyError, a_registry.register_lazy, "one", "three", "four") a_registry.register_lazy("one", "module", "member", override_existing=True) def test_registry_help(self): a_registry = registry.Registry() a_registry.register("one", 1, help="help text for one") # We should not have to import the module to return the help # information a_registry.register_lazy( "two", "nonexistent_module", "member", help="help text for two" ) # We should be able to handle a callable to get information help_calls = [] def generic_help(reg, key): help_calls.append(key) return f"generic help for {key}" a_registry.register("three", 3, help=generic_help) a_registry.register_lazy( "four", "nonexistent_module", "member2", help=generic_help ) a_registry.register("five", 5) def help_from_object(reg, key): obj = reg.get(key) return obj.help() class SimpleObj: def help(self): return "this is my help" a_registry.register("six", SimpleObj(), help=help_from_object) self.assertEqual("help text for one", a_registry.get_help("one")) self.assertEqual("help text for two", a_registry.get_help("two")) self.assertEqual("generic help for three", a_registry.get_help("three")) self.assertEqual(["three"], help_calls) self.assertEqual("generic help for four", a_registry.get_help("four")) self.assertEqual(["three", "four"], help_calls) self.assertEqual(None, a_registry.get_help("five")) self.assertEqual("this is my help", a_registry.get_help("six")) self.assertRaises(KeyError, a_registry.get_help, None) self.assertRaises(KeyError, a_registry.get_help, "seven") a_registry.default_key = "one" self.assertEqual("help text for one", a_registry.get_help(None)) self.assertRaises(KeyError, a_registry.get_help, "seven") self.assertEqual( [ ("five", None), ("four", "generic help for four"), ("one", "help text for one"), ("six", "this is my help"), ("three", "generic help for three"), ("two", "help text for two"), ], sorted((key, a_registry.get_help(key)) for key in a_registry.keys()), ) # We don't know what order it was called in, but we should get # 2 more calls to three and four self.assertEqual(["four", "four", "three", "three"], sorted(help_calls)) def test_registry_info(self): a_registry = registry.Registry() a_registry.register("one", 1, info="string info") # We should not have to import the module to return the info a_registry.register_lazy("two", "nonexistent_module", "member", info=2) # We should be able to handle a callable to get information a_registry.register("three", 3, info=["a", "list"]) obj = object() a_registry.register_lazy("four", "nonexistent_module", "member2", info=obj) a_registry.register("five", 5) self.assertEqual("string info", a_registry.get_info("one")) self.assertEqual(2, a_registry.get_info("two")) self.assertEqual(["a", "list"], a_registry.get_info("three")) self.assertIs(obj, a_registry.get_info("four")) self.assertIs(None, a_registry.get_info("five")) self.assertRaises(KeyError, a_registry.get_info, None) self.assertRaises(KeyError, a_registry.get_info, "six") a_registry.default_key = "one" self.assertEqual("string info", a_registry.get_info(None)) self.assertRaises(KeyError, a_registry.get_info, "six") self.assertEqual( [ ("five", None), ("four", obj), ("one", "string info"), ("three", ["a", "list"]), ("two", 2), ], sorted((key, a_registry.get_info(key)) for key in a_registry.keys()), ) def test_get_prefix(self): my_registry = registry.Registry() http_object = object() sftp_object = object() my_registry.register("http:", http_object) my_registry.register("sftp:", sftp_object) found_object, suffix = my_registry.get_prefix("http://foo/bar") self.assertEqual("//foo/bar", suffix) self.assertIs(http_object, found_object) self.assertIsNot(sftp_object, found_object) found_object, suffix = my_registry.get_prefix("sftp://baz/qux") self.assertEqual("//baz/qux", suffix) self.assertIs(sftp_object, found_object) def test_registry_alias(self): a_registry = registry.Registry() a_registry.register("one", 1, info="string info") a_registry.register_alias("two", "one") a_registry.register_alias("three", "one", info="own info") self.assertEqual(a_registry.get("one"), a_registry.get("two")) self.assertEqual(a_registry.get_help("one"), a_registry.get_help("two")) self.assertEqual(a_registry.get_info("one"), a_registry.get_info("two")) self.assertEqual("own info", a_registry.get_info("three")) self.assertEqual({"two": "one", "three": "one"}, a_registry.aliases()) self.assertEqual( {"one": ["three", "two"]}, {k: sorted(v) for (k, v) in a_registry.alias_map().items()}, ) def test_registry_alias_exists(self): a_registry = registry.Registry() a_registry.register("one", 1, info="string info") a_registry.register("two", 2) self.assertRaises(KeyError, a_registry.register_alias, "one", "one") def test_registry_alias_targetmissing(self): a_registry = registry.Registry() self.assertRaises(KeyError, a_registry.register_alias, "one", "two") class TestRegistryIter(TestCase): """Test registry iteration behaviors. There are dark corner cases here when the registered objects trigger addition in the iterated registry. """ def setUp(self): super().setUp() # We create a registry with "official" objects and "hidden" # objects. The later represent the side effects that led to bug #277048 # and #430510 _registry = registry.Registry() def register_more(): _registry.register("hidden", None) # Avoid closing over self by binding local variable self.registry = _registry self.registry.register("passive", None) self.registry.register("active", register_more) self.registry.register("passive-too", None) class InvasiveGetter(registry._ObjectGetter): def get_obj(inner_self): # noqa: N805 # Surprise ! Getting a registered object (think lazy loaded # module) register yet another object ! _registry.register("more hidden", None) return inner_self._obj self.registry.register("hacky", None) # We peek under the covers because the alternative is to use lazy # registration and create a module that can reference our test registry # it's too much work for such a corner case -- vila 090916 self.registry._dict["hacky"] = InvasiveGetter(None) def _iter_them(self, iter_func_name): iter_func = getattr(self.registry, iter_func_name, None) self.assertIsNot(None, iter_func) count = 0 for name, func in iter_func(): count += 1 self.assertNotIn(name, ("hidden", "more hidden")) if func is not None: # Using an object register another one as a side effect func() self.assertEqual(4, count) def test_iteritems(self): # the dict is modified during the iteration self.assertRaises(RuntimeError, self._iter_them, "iteritems") def test_items(self): # we should be able to iterate even if one item modify the dict self._iter_them("items") class TestRegistryWithDirs(TestCaseInTempDir): """Registry tests that require temporary dirs.""" def create_plugin_file(self, contents): """Create a file to be used as a plugin. This is created in a temporary directory, so that we are sure that it doesn't start in the plugin path. """ os.mkdir("tmp") plugin_name = f"bzr_plugin_a_{osutils.rand_chars(4)}" with open("tmp/" + plugin_name + ".py", "wb") as f: f.write(contents) return plugin_name def create_simple_plugin(self): return self.create_plugin_file( b'object1 = "foo"\n' b"\n\n" b"def function(a,b,c):\n" b" return a,b,c\n" b"\n\n" b"class MyClass(object):\n" b" def __init__(self, a):\n" b" self.a = a\n" b"\n\n" ) def test_lazy_import_registry_foo(self): a_registry = registry.Registry() a_registry.register_lazy("foo", "bzrformats.revision", "Revision") a_registry.register_lazy("bar", "bzrformats.revision", "NULL_REVISION") from bzrformats.revision import NULL_REVISION, Revision self.assertEqual(Revision, a_registry.get("foo")) self.assertEqual(NULL_REVISION, a_registry.get("bar")) def test_lazy_import_registry(self): plugin_name = self.create_simple_plugin() a_registry = registry.Registry() a_registry.register_lazy("obj", plugin_name, "object1") a_registry.register_lazy("function", plugin_name, "function") a_registry.register_lazy("klass", plugin_name, "MyClass") a_registry.register_lazy("module", plugin_name, None) self.assertEqual( ["function", "klass", "module", "obj"], sorted(a_registry.keys()) ) # The plugin should not be loaded until we grab the first object self.assertNotIn(plugin_name, sys.modules) # By default the plugin won't be in the search path self.assertRaises(ImportError, a_registry.get, "obj") plugin_path = self.test_dir + "/tmp" # noqa: S108 sys.path.append(plugin_path) try: obj = a_registry.get("obj") self.assertEqual("foo", obj) self.assertIn(plugin_name, sys.modules) # Now grab another object func = a_registry.get("function") self.assertEqual(plugin_name, func.__module__) self.assertEqual("function", func.__name__) self.assertEqual((1, [], "3"), func(1, [], "3")) # And finally a class klass = a_registry.get("klass") self.assertEqual(plugin_name, klass.__module__) self.assertEqual("MyClass", klass.__name__) inst = klass(1) self.assertIsInstance(inst, klass) self.assertEqual(1, inst.a) module = a_registry.get("module") self.assertIs(obj, module.object1) self.assertIs(func, module.function) self.assertIs(klass, module.MyClass) finally: sys.path.remove(plugin_path) def test_lazy_import_get_module(self): a_registry = registry.Registry() a_registry.register_lazy("obj", "bzrformats.tests.test_registry", "object1") self.assertEqual( "bzrformats.tests.test_registry", a_registry._get_module("obj") ) def test_normal_get_module(self): class AThing: """Something.""" a_registry = registry.Registry() a_registry.register("obj", AThing()) self.assertEqual( "bzrformats.tests.test_registry", a_registry._get_module("obj") ) bzrformats_3.4.0.orig/bzrformats/tests/test_revision.py0000644000000000000000000000642015162115103020455 0ustar00# Copyright (C) 2005-2011, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA from ..revision import Revision from . import TestCase class TestRevisionMethods(TestCase): def test_get_summary(self): r = Revision( b"1", parent_ids=[], committer="", message="a", timestamp=0, timezone=0, inventory_sha1=None, properties={}, ) self.assertEqual("a", r.get_summary()) r = Revision( b"1", parent_ids=[], committer="", message="a\nb", timestamp=0, timezone=0, inventory_sha1=None, properties={}, ) self.assertEqual("a", r.get_summary()) r = Revision( b"1", parent_ids=[], committer="", message="\na\nb", timestamp=0, timezone=0, inventory_sha1=None, properties={}, ) self.assertEqual("a", r.get_summary()) r = Revision( b"1", parent_ids=[], committer="", message="", timestamp=0, timezone=0, inventory_sha1=None, properties={}, ) self.assertEqual("", r.get_summary()) def test_get_apparent_authors(self): r = Revision( b"1", parent_ids=[], committer="A", message="", timestamp=0, timezone=0, inventory_sha1=None, properties={}, ) self.assertEqual(["A"], r.get_apparent_authors()) r = Revision( b"1", parent_ids=[], committer="A", message="", timestamp=0, timezone=0, inventory_sha1=None, properties={"author": "B"}, ) self.assertEqual(["B"], r.get_apparent_authors()) r = Revision( b"1", parent_ids=[], committer="A", message="", timestamp=0, timezone=0, inventory_sha1=None, properties={"author": "B", "authors": "C\nD"}, ) self.assertEqual(["C", "D"], r.get_apparent_authors()) def test_get_apparent_authors_no_committer(self): r = Revision( b"1", parent_ids=[], committer="", message="", timestamp=0, timezone=0, inventory_sha1=None, properties={}, ) self.assertEqual([], r.get_apparent_authors()) bzrformats_3.4.0.orig/bzrformats/tests/test_rio.py0000644000000000000000000003425115162115103017413 0ustar00# Copyright (C) 2005, 2006, 2007, 2009, 2010, 2011, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for rio serialization. A simple, reproducible structured IO format. rio itself works in Unicode strings. It is typically encoded to UTF-8, but this depends on the transport. """ import re from tempfile import TemporaryFile from .. import rio as _mod_rio from ..osutils import IterableFile from ..rio_patch import read_patch_stanza, to_patch_lines from . import TestCase def rio_file(stanzas): return IterableFile(_mod_rio.rio_iter(stanzas)) class TestRio(TestCase): def test_stanza(self): """Construct rio stanza in memory.""" s = _mod_rio.Stanza(number="42", name="fred") self.assertIn("number", s) self.assertNotIn("color", s) self.assertNotIn("42", s) self.assertEqual(list(s.iter_pairs()), [("name", "fred"), ("number", "42")]) self.assertEqual(s.get("number"), "42") self.assertEqual(s.get("name"), "fred") def test_empty_value(self): """Serialize stanza with empty field.""" s = _mod_rio.Stanza(empty="") self.assertEqual(s.to_string(), b"empty: \n") def test_to_lines(self): """Write simple rio stanza to string.""" s = _mod_rio.Stanza(number="42", name="fred") self.assertEqual(list(s.to_lines()), [b"name: fred\n", b"number: 42\n"]) def test_as_dict(self): """Convert rio Stanza to dictionary.""" s = _mod_rio.Stanza(number="42", name="fred") sd = s.as_dict() self.assertEqual(sd, {"number": "42", "name": "fred"}) def test_to_file(self): """Write rio to file.""" tmpf = TemporaryFile() s = _mod_rio.Stanza( a_thing='something with "quotes like \\"this\\""', number="42", name="fred" ) s.write(tmpf) tmpf.seek(0) self.assertEqual( tmpf.read(), b"""\ a_thing: something with "quotes like \\"this\\"" name: fred number: 42 """, ) def test_multiline_string(self): tmpf = TemporaryFile() s = _mod_rio.Stanza( motto="war is peace\nfreedom is slavery\nignorance is strength" ) s.write(tmpf) tmpf.seek(0) self.assertEqual( tmpf.read(), b"""\ motto: war is peace \tfreedom is slavery \tignorance is strength """, ) tmpf.seek(0) s2 = _mod_rio.read_stanza(tmpf) self.assertEqual(s, s2) def test_read_stanza(self): """Load stanza from string.""" lines = b"""\ revision: mbp@sourcefrog.net-123-abc timestamp: 1130653962 timezone: 36000 committer: Martin Pool """.splitlines(True) s = _mod_rio.read_stanza(lines) self.assertIn("revision", s) self.assertEqual(s.get("revision"), "mbp@sourcefrog.net-123-abc") self.assertEqual( list(s.iter_pairs()), [ ("revision", "mbp@sourcefrog.net-123-abc"), ("timestamp", "1130653962"), ("timezone", "36000"), ("committer", "Martin Pool "), ], ) self.assertEqual(len(s), 4) def test_repeated_field(self): """Repeated field in rio.""" s = _mod_rio.Stanza() for k, v in [ ("a", "10"), ("b", "20"), ("a", "100"), ("b", "200"), ("a", "1000"), ("b", "2000"), ]: s.add(k, v) s2 = _mod_rio.read_stanza(s.to_lines()) self.assertEqual(s, s2) self.assertEqual(s.get_all("a"), ["10", "100", "1000"]) self.assertEqual(s.get_all("b"), ["20", "200", "2000"]) def test_backslash(self): s = _mod_rio.Stanza(q="\\") t = s.to_string() self.assertEqual(t, b"q: \\\n") s2 = _mod_rio.read_stanza(s.to_lines()) self.assertEqual(s, s2) def test_blank_line(self): s = _mod_rio.Stanza(none="", one="\n", two="\n\n") self.assertEqual( s.to_string(), b"""\ none:\x20 one:\x20 \t two:\x20 \t \t """, ) s2 = _mod_rio.read_stanza(s.to_lines()) self.assertEqual(s, s2) def test_whitespace_value(self): s = _mod_rio.Stanza(space=" ", tabs="\t\t\t", combo="\n\t\t\n") self.assertEqual( s.to_string(), b"""\ combo:\x20 \t\t\t \t space:\x20\x20 tabs: \t\t\t """, ) s2 = _mod_rio.read_stanza(s.to_lines()) self.assertEqual(s, s2) self.rio_file_stanzas([s]) def test_quoted(self): """Rio quoted string cases.""" s = _mod_rio.Stanza( q1='"hello"', q2=' "for', q3='\n\n"for"\n', q4='for\n"\nfor', q5="\n", q6='"', q7='""', q8="\\", q9='\\"\\"', ) s2 = _mod_rio.read_stanza(s.to_lines()) self.assertEqual(s, s2) # apparent bug in read_stanza # s3 = _mod_rio.read_stanza(self.stanzas_to_str([s])) # self.assertEqual(s, s3) def test_read_empty(self): """Detect end of rio file.""" s = _mod_rio.read_stanza([]) self.assertEqual(s, None) self.assertIsNone(s) def test_read_nul_byte(self): """File consisting of a nul byte causes an error.""" self.assertRaises(ValueError, _mod_rio.read_stanza, [b"\0"]) def test_read_nul_bytes(self): """File consisting of many nul bytes causes an error.""" self.assertRaises(ValueError, _mod_rio.read_stanza, [b"\0" * 100]) def test_read_iter(self): """Read several stanzas from file.""" tmpf = TemporaryFile() tmpf.write( b"""\ version_header: 1 name: foo val: 123 name: bar val: 129319 """ ) tmpf.seek(0) reader = _mod_rio.read_stanzas(tmpf) stuff = list(reader) self.assertEqual( stuff, [ _mod_rio.Stanza(version_header="1"), _mod_rio.Stanza(name="foo", val="123"), _mod_rio.Stanza(name="bar", val="129319"), ], ) def test_read_several(self): """Read several stanzas from file.""" tmpf = TemporaryFile() tmpf.write( b"""\ version_header: 1 name: foo val: 123 name: quoted address: "Willowglen" \t 42 Wallaby Way \t Sydney name: bar val: 129319 """ ) tmpf.seek(0) s = _mod_rio.read_stanza(tmpf) self.assertEqual(s, _mod_rio.Stanza(version_header="1")) s = _mod_rio.read_stanza(tmpf) self.assertEqual(s, _mod_rio.Stanza(name="foo", val="123")) s = _mod_rio.read_stanza(tmpf) self.assertEqual(s.get("name"), "quoted") self.assertEqual(s.get("address"), ' "Willowglen"\n 42 Wallaby Way\n Sydney') s = _mod_rio.read_stanza(tmpf) self.assertEqual(s, _mod_rio.Stanza(name="bar", val="129319")) s = _mod_rio.read_stanza(tmpf) self.assertEqual(s, None) def check_rio_file(self, real_file): real_file.seek(0) read_write = rio_file(_mod_rio.RioReader(real_file)).read() real_file.seek(0) self.assertEqual(read_write, real_file.read()) @staticmethod def stanzas_to_str(stanzas): return rio_file(stanzas).read() def rio_file_stanzas(self, stanzas): new_stanzas = list(_mod_rio.RioReader(rio_file(stanzas))) self.assertEqual(new_stanzas, stanzas) def test_tricky_quoted(self): tmpf = TemporaryFile() tmpf.write( b'''\ s: "one" s:\x20 \t"one" \t s: " s: "" s: """ s:\x20 \t s: \\ s:\x20 \t\\ \t\\\\ \t s: word\\ s: quote" s: backslashes\\\\\\ s: both\\\" ''' ) tmpf.seek(0) expected_vals = [ '"one"', '\n"one"\n', '"', '""', '"""', "\n", "\\", "\n\\\n\\\\\n", "word\\", 'quote"', "backslashes\\\\\\", 'both\\"', ] for expected in expected_vals: stanza = _mod_rio.read_stanza(tmpf) self.rio_file_stanzas([stanza]) self.assertEqual(len(stanza), 1) self.assertEqual(stanza.get("s"), expected) def test_write_empty_stanza(self): """Write empty stanza.""" l = list(_mod_rio.Stanza().to_lines()) self.assertEqual(l, []) def test_rio_raises_type_error(self): """TypeError on adding invalid type to Stanza.""" s = _mod_rio.Stanza() self.assertRaises(TypeError, s.add, "foo", {}) def test_rio_raises_type_error_key(self): """TypeError on adding invalid type to Stanza.""" s = _mod_rio.Stanza() self.assertRaises(TypeError, s.add, 10, {}) def test_rio_surrogateescape(self): raw_bytes = b"\xcb" self.assertRaises(UnicodeDecodeError, raw_bytes.decode, "utf-8") try: uni_data = raw_bytes.decode("utf-8", "surrogateescape") except LookupError: self.skipTest("surrogateescape is not available on Python < 3") try: _mod_rio.Stanza(foo=uni_data) except TypeError: pass else: self.fail() def test_rio_unicode(self): uni_data = "\N{KATAKANA LETTER O}" s = _mod_rio.Stanza(foo=uni_data) self.assertEqual(s.get("foo"), uni_data) raw_lines = s.to_lines() self.assertEqual(raw_lines, [b"foo: " + uni_data.encode("utf-8") + b"\n"]) new_s = _mod_rio.read_stanza(raw_lines) self.assertEqual(new_s.get("foo"), uni_data) def mail_munge(self, lines, dos_nl=True): new_lines = [] for line in lines: line = re.sub(b" *\n", b"\n", line) if dos_nl: line = re.sub(b"([^\r])\n", b"\\1\r\n", line) new_lines.append(line) return new_lines def test_patch_rio(self): stanza = _mod_rio.Stanza(data="#\n\r\\r ", space=" " * 255, hash="#" * 255) lines = to_patch_lines(stanza) for line in lines: self.assertContainsRe(line, b"^# ") self.assertGreaterEqual(72, len(line)) for line in to_patch_lines(stanza, max_width=12): self.assertGreaterEqual(12, len(line)) new_stanza = read_patch_stanza(self.mail_munge(lines, dos_nl=False)) lines = self.mail_munge(lines) new_stanza = read_patch_stanza(lines) self.assertEqual("#\n\r\\r ", new_stanza.get("data")) self.assertEqual(" " * 255, new_stanza.get("space")) self.assertEqual("#" * 255, new_stanza.get("hash")) def test_patch_rio_linebreaks(self): stanza = _mod_rio.Stanza(breaktest="linebreak -/" * 30) line1 = to_patch_lines(stanza, 71)[0] self.assertContainsRe(line1, b"linebreak\\\\\n") stanza = _mod_rio.Stanza(breaktest="linebreak-/" * 30) self.assertContainsRe(to_patch_lines(stanza, 70)[0], b"linebreak-\\\\\n") stanza = _mod_rio.Stanza(breaktest="linebreak/" * 30) self.assertContainsRe(to_patch_lines(stanza, 70)[0], b"linebreak\\\\\n") class TestValidTag(TestCase): def test_ok(self): self.assertTrue(_mod_rio.valid_tag("foo")) def test_no_spaces(self): self.assertFalse(_mod_rio.valid_tag("foo bla")) def test_numeric(self): self.assertTrue(_mod_rio.valid_tag("3foo423")) def test_no_colon(self): self.assertFalse(_mod_rio.valid_tag("foo:bla")) def test_type_error(self): self.assertRaises(TypeError, _mod_rio.valid_tag, 423) def test_empty(self): self.assertFalse(_mod_rio.valid_tag("")) def test_unicode(self): # When str is a unicode type, it is valid for a tag self.assertTrue(_mod_rio.valid_tag("foo")) def test_non_ascii_char(self): self.assertFalse(_mod_rio.valid_tag("\xb5")) class TestReadUTF8Stanza(TestCase): def assertReadStanza(self, result, line_iter): s = _mod_rio.read_stanza(line_iter) self.assertEqual(result, s) if s is not None: for tag, value in s.iter_pairs(): self.assertIsInstance(tag, str) self.assertIsInstance(value, str) def assertReadStanzaRaises(self, exception, line_iter): self.assertRaises(exception, _mod_rio.read_stanza, line_iter) def test_no_string(self): self.assertReadStanzaRaises(TypeError, [21323]) def test_empty(self): self.assertReadStanza(None, []) def test_none(self): self.assertReadStanza(None, [b""]) def test_simple(self): self.assertReadStanza(_mod_rio.Stanza(foo="bar"), [b"foo: bar\n", b""]) def test_multi_line(self): self.assertReadStanza( _mod_rio.Stanza(foo="bar\nbla"), [b"foo: bar\n", b"\tbla\n"] ) def test_repeated(self): s = _mod_rio.Stanza() s.add("foo", "bar") s.add("foo", "foo") self.assertReadStanza(s, [b"foo: bar\n", b"foo: foo\n"]) def test_invalid_early_colon(self): self.assertReadStanzaRaises(ValueError, [b"f:oo: bar\n"]) def test_invalid_tag(self): self.assertReadStanzaRaises(ValueError, [b"f%oo: bar\n"]) def test_continuation_too_early(self): self.assertReadStanzaRaises(ValueError, [b"\tbar\n"]) def test_large(self): value = b"bla" * 9000 self.assertReadStanza( _mod_rio.Stanza(foo=value.decode()), [b"foo: %s\n" % value] ) def test_non_ascii_char(self): self.assertReadStanza( _mod_rio.Stanza(foo="n\xe5me"), ["foo: n\xe5me\n".encode()] ) bzrformats_3.4.0.orig/bzrformats/tests/test_serializer.py0000644000000000000000000000314515162115103020771 0ustar00# Copyright (C) 2005 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for the revision/inventory Serializers.""" from bzrformats import chk_serializer, xml5, xml6, xml7, xml8 from .. import serializer from . import TestCase class TestSerializer(TestCase): """Test serializer.""" def test_registry(self): self.assertIs( xml5.revision_serializer_v5, serializer.revision_format_registry.get("5") ) self.assertIs( xml8.revision_serializer_v8, serializer.revision_format_registry.get("8") ) self.assertIs( xml6.inventory_serializer_v6, serializer.inventory_format_registry.get("6") ) self.assertIs( xml7.inventory_serializer_v7, serializer.inventory_format_registry.get("7") ) self.assertIs( chk_serializer.inventory_chk_serializer_255_bigpage_9, serializer.inventory_format_registry.get("9"), ) bzrformats_3.4.0.orig/bzrformats/tests/test_textmerge.py0000644000000000000000000000420415162115103020621 0ustar00# Copyright (C) 2006 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # Author: Aaron Bentley """Tests for text merging functionality.""" from ..textmerge import Merge2 from . import TestCase class TestMerge2(TestCase): """Test the Merge2 text merging class.""" def test_agreed(self): """Test merging identical text produces the same result.""" lines = "a\nb\nc\nd\ne\nf\n".splitlines(True) mlines = list(Merge2(lines, lines).merge_lines()[0]) self.assertEqualDiff(mlines, lines) def test_conflict(self): """Test merging conflicting text produces appropriate conflict markers.""" lines_a = "a\nb\nc\nd\ne\nf\ng\nh\n".splitlines(True) lines_b = "z\nb\nx\nd\ne\ne\nf\ng\ny\n".splitlines(True) expected = ( "<\na\n=\nz\n>\nb\n<\nc\n=\nx\n>\nd\ne\n<\n=\ne\n>\nf\ng\n<\nh\n=\ny\n>\n" ) m2 = Merge2(lines_a, lines_b, "<\n", ">\n", "=\n") mlines = m2.merge_lines()[0] self.assertEqualDiff("".join(mlines), expected) mlines = m2.merge_lines(reprocess=True)[0] self.assertEqualDiff("".join(mlines), expected) def test_reprocess(self): """Test the reprocess_struct method for conflict resolution.""" struct = [("a", "b"), ("c",), ("def", "geh"), ("i",)] expect = [("a", "b"), ("c",), ("d", "g"), ("e",), ("f", "h"), ("i",)] result = Merge2.reprocess_struct(struct) self.assertEqual(list(result), expect) bzrformats_3.4.0.orig/bzrformats/tests/test_tuned_gzip.py0000644000000000000000000000351015162115103020764 0ustar00# Copyright (C) 2006, 2009, 2010, 2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for tuned_gzip.""" import gzip from io import BytesIO from bzrformats import tuned_gzip from . import TestCase class TestToGzip(TestCase): def assertToGzip(self, chunks): raw_bytes = b"".join(chunks) gzfromchunks = tuned_gzip.chunks_to_gzip(chunks) decoded = gzip.GzipFile(fileobj=BytesIO(b"".join(gzfromchunks))).read() lraw, ldecoded = len(raw_bytes), len(decoded) self.assertEqual( lraw, ldecoded, "Expecting data length %d, got %d" % (lraw, ldecoded) ) self.assertEqual(raw_bytes, decoded) def test_single_chunk(self): self.assertToGzip([b"a modest chunk\nwith some various\nbits\n"]) def test_simple_text(self): self.assertToGzip([b"some\n", b"strings\n", b"to\n", b"process\n"]) def test_large_chunks(self): self.assertToGzip([b"a large string\n" * 1024]) self.assertToGzip([b"a large string\n"] * 1024) def test_enormous_chunks(self): self.assertToGzip([b"a large string\n" * 1024 * 256]) self.assertToGzip([b"a large string\n"] * 1024 * 256) bzrformats_3.4.0.orig/bzrformats/tests/test_versionedfile.py0000644000000000000000000001464015162115103021460 0ustar00# Copyright (C) 2010 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for VersionedFile classes.""" from .. import errors, groupcompress, multiparent, versionedfile from . import TestCase, TestCaseWithMemoryTransport class Test_MPDiffGenerator(TestCaseWithMemoryTransport): # Should this be a per vf test? def make_vf(self): t = self.get_transport("") factory = groupcompress.make_pack_factory(True, True, 1) return factory(t) def make_three_vf(self): vf = self.make_vf() vf.add_lines((b"one",), (), [b"first\n"]) vf.add_lines((b"two",), [(b"one",)], [b"first\n", b"second\n"]) vf.add_lines( (b"three",), [(b"one",), (b"two",)], [b"first\n", b"second\n", b"third\n"] ) return vf def test_finds_parents(self): vf = self.make_three_vf() gen = versionedfile._MPDiffGenerator(vf, [(b"three",)]) needed_keys, refcount = gen._find_needed_keys() self.assertEqual( sorted([(b"one",), (b"two",), (b"three",)]), sorted(needed_keys) ) self.assertEqual({(b"one",): 1, (b"two",): 1}, refcount) def test_ignores_ghost_parents(self): # If a parent is a ghost, it is just ignored vf = self.make_vf() vf.add_lines((b"two",), [(b"one",)], [b"first\n", b"second\n"]) gen = versionedfile._MPDiffGenerator(vf, [(b"two",)]) needed_keys, refcount = gen._find_needed_keys() self.assertEqual(sorted([(b"two",)]), sorted(needed_keys)) # It is returned, but we don't really care as we won't extract it self.assertEqual({(b"one",): 1}, refcount) self.assertEqual([(b"one",)], sorted(gen.ghost_parents)) self.assertEqual([], sorted(gen.present_parents)) def test_raises_on_ghost_keys(self): # If the requested key is a ghost, then we have a problem vf = self.make_vf() gen = versionedfile._MPDiffGenerator(vf, [(b"one",)]) self.assertRaises(errors.RevisionNotPresent, gen._find_needed_keys) def test_refcount_multiple_children(self): vf = self.make_three_vf() gen = versionedfile._MPDiffGenerator(vf, [(b"two",), (b"three",)]) needed_keys, refcount = gen._find_needed_keys() self.assertEqual( sorted([(b"one",), (b"two",), (b"three",)]), sorted(needed_keys) ) self.assertEqual({(b"one",): 2, (b"two",): 1}, refcount) self.assertEqual([(b"one",)], sorted(gen.present_parents)) def test_process_contents(self): vf = self.make_three_vf() gen = versionedfile._MPDiffGenerator(vf, [(b"two",), (b"three",)]) gen._find_needed_keys() self.assertEqual( {(b"two",): ((b"one",),), (b"three",): ((b"one",), (b"two",))}, gen.parent_map, ) self.assertEqual({(b"one",): 2, (b"two",): 1}, gen.refcounts) self.assertEqual( sorted([(b"one",), (b"two",), (b"three",)]), sorted(gen.needed_keys) ) stream = vf.get_record_stream(gen.needed_keys, "topological", True) record = next(stream) self.assertEqual((b"one",), record.key) # one is not needed in the output, but it is needed by children. As # such, it should end up in the various caches gen._process_one_record(record.key, record.get_bytes_as("chunked")) # The chunks should be cached, the refcount untouched self.assertEqual({(b"one",)}, set(gen.chunks)) self.assertEqual({(b"one",): 2, (b"two",): 1}, gen.refcounts) self.assertEqual(set(), set(gen.diffs)) # Next we get 'two', which is something we output, but also needed for # three record = next(stream) self.assertEqual((b"two",), record.key) gen._process_one_record(record.key, record.get_bytes_as("chunked")) # Both are now cached, and the diff for two has been extracted, and # one's refcount has been updated. two has been removed from the # parent_map self.assertEqual({(b"one",), (b"two",)}, set(gen.chunks)) self.assertEqual({(b"one",): 1, (b"two",): 1}, gen.refcounts) self.assertEqual({(b"two",)}, set(gen.diffs)) self.assertEqual({(b"three",): ((b"one",), (b"two",))}, gen.parent_map) # Finally 'three', which allows us to remove all parents from the # caches record = next(stream) self.assertEqual((b"three",), record.key) gen._process_one_record(record.key, record.get_bytes_as("chunked")) # Both are now cached, and the diff for two has been extracted, and # one's refcount has been updated self.assertEqual(set(), set(gen.chunks)) self.assertEqual({}, gen.refcounts) self.assertEqual({(b"two",), (b"three",)}, set(gen.diffs)) def test_compute_diffs(self): vf = self.make_three_vf() # The content is in the order requested, even if it isn't topological gen = versionedfile._MPDiffGenerator(vf, [(b"two",), (b"three",), (b"one",)]) diffs = gen.compute_diffs() expected_diffs = [ multiparent.MultiParent( [multiparent.ParentText(0, 0, 0, 1), multiparent.NewText([b"second\n"])] ), multiparent.MultiParent( [multiparent.ParentText(1, 0, 0, 2), multiparent.NewText([b"third\n"])] ), multiparent.MultiParent([multiparent.NewText([b"first\n"])]), ] self.assertEqual(expected_diffs, diffs) class ErrorTests(TestCase): def test_unavailable_representation(self): error = versionedfile.UnavailableRepresentation(("key",), "mpdiff", "fulltext") self.assertEqualDiff( "The encoding 'mpdiff' is not available for key " "('key',) which is encoded as 'fulltext'.", str(error), ) bzrformats_3.4.0.orig/bzrformats/tests/test_weave.py0000644000000000000000000005316215162115103017733 0ustar00# Copyright (C) 2005-2011, 2016 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # TODO: tests regarding version names # TODO: rbc 20050108 test that join does not leave an inconsistent weave # if it fails. """test suite for weave algorithm.""" from io import BytesIO from pprint import pformat from ..errors import ReservedId, RevisionAlreadyPresent, RevisionNotPresent from ..osutils import sha_string from ..weave import Weave, WeaveFormatError, WeaveInvalidChecksum from ..weavefile import read_weave, write_weave from . import TestCase, TestCaseInTempDir # texts for use in testing TEXT_0 = [b"Hello world"] TEXT_1 = [b"Hello world", b"A second line"] class TestBase(TestCase): def check_read_write(self, k): """Check the weave k can be written & re-read.""" from tempfile import TemporaryFile tf = TemporaryFile() write_weave(k, tf) tf.seek(0) k2 = read_weave(tf) if k != k2: tf.seek(0) self.log("serialized weave:") self.log(tf.read()) self.log("") self.log("parents: %s" % (k._parents == k2._parents)) self.log(f" {k._parents!r}") self.log(f" {k2._parents!r}") self.log("") self.fail("read/write check failed") class WeaveContains(TestBase): """Weave __contains__ operator.""" def runTest(self): k = Weave(get_scope=lambda: None) self.assertNotIn(b"foo", k) k.add_lines(b"foo", [], TEXT_1) self.assertIn(b"foo", k) class Easy(TestBase): def runTest(self): Weave() class AnnotateOne(TestBase): def runTest(self): k = Weave() k.add_lines(b"text0", [], TEXT_0) self.assertEqual(k.annotate(b"text0"), [(b"text0", TEXT_0[0])]) class InvalidAdd(TestBase): """Try to use invalid version number during add.""" def runTest(self): k = Weave() self.assertRaises( RevisionNotPresent, k.add_lines, b"text0", [b"69"], [b"new text!"] ) class RepeatedAdd(TestBase): """Add the same version twice; harmless.""" def test_duplicate_add(self): k = Weave() idx = k.add_lines(b"text0", [], TEXT_0) idx2 = k.add_lines(b"text0", [], TEXT_0) self.assertEqual(idx, idx2) class InvalidRepeatedAdd(TestBase): def runTest(self): k = Weave() k.add_lines(b"basis", [], TEXT_0) k.add_lines(b"text0", [], TEXT_0) self.assertRaises( RevisionAlreadyPresent, k.add_lines, b"text0", [], [b"not the same text"], ) self.assertRaises( RevisionAlreadyPresent, k.add_lines, b"text0", [b"basis"], # not the right parents TEXT_0, ) class InsertLines(TestBase): """Store a revision that adds one line to the original. Look at the annotations to make sure that the first line is matched and not stored repeatedly. """ def runTest(self): k = Weave() k.add_lines(b"text0", [], [b"line 1"]) k.add_lines(b"text1", [b"text0"], [b"line 1", b"line 2"]) self.assertEqual(k.annotate(b"text0"), [(b"text0", b"line 1")]) self.assertEqual(k.get_lines(1), [b"line 1", b"line 2"]) self.assertEqual( k.annotate(b"text1"), [(b"text0", b"line 1"), (b"text1", b"line 2")] ) k.add_lines(b"text2", [b"text0"], [b"line 1", b"diverged line"]) self.assertEqual( k.annotate(b"text2"), [(b"text0", b"line 1"), (b"text2", b"diverged line")] ) text3 = [b"line 1", b"middle line", b"line 2"] k.add_lines(b"text3", [b"text0", b"text1"], text3) # self.log("changes to text3: " + pformat(list(k._delta(set([0, 1]), # text3)))) self.log("k._weave=" + pformat(k._weave)) self.assertEqual( k.annotate(b"text3"), [(b"text0", b"line 1"), (b"text3", b"middle line"), (b"text1", b"line 2")], ) # now multiple insertions at different places k.add_lines( b"text4", [b"text0", b"text1", b"text3"], [b"line 1", b"aaa", b"middle line", b"bbb", b"line 2", b"ccc"], ) self.assertEqual( k.annotate(b"text4"), [ (b"text0", b"line 1"), (b"text4", b"aaa"), (b"text3", b"middle line"), (b"text4", b"bbb"), (b"text1", b"line 2"), (b"text4", b"ccc"), ], ) class DeleteLines(TestBase): """Deletion of lines from existing text. Try various texts all based on a common ancestor. """ def runTest(self): k = Weave() base_text = [b"one", b"two", b"three", b"four"] k.add_lines(b"text0", [], base_text) texts = [ [b"one", b"two", b"three"], [b"two", b"three", b"four"], [b"one", b"four"], [b"one", b"two", b"three", b"four"], ] i = 1 for t in texts: k.add_lines(b"text%d" % i, [b"text0"], t) i += 1 self.log("final weave:") self.log("k._weave=" + pformat(k._weave)) for i in range(len(texts)): self.assertEqual(k.get_lines(i + 1), texts[i]) class SuicideDelete(TestBase): """Invalid weave which tries to add and delete simultaneously.""" def runTest(self): k = Weave() k._parents = [ (), ] k._weave = [ (b"{", 0), b"first line", (b"[", 0), b"deleted in 0", (b"]", 0), (b"}", 0), ] # SKIPPED # Weave.get doesn't trap this anymore return self.assertRaises(WeaveFormatError, k.get_lines, 0) class CannedDelete(TestBase): """Unpack canned weave with deleted lines.""" def runTest(self): k = Weave() k._parents = [ (), frozenset([0]), ] k._weave = [ (b"{", 0), b"first line", (b"[", 1), b"line to be deleted", (b"]", 1), b"last line", (b"}", 0), ] k._sha1s = [ sha_string(b"first lineline to be deletedlast line"), sha_string(b"first linelast line"), ] self.assertEqual( k.get_lines(0), [ b"first line", b"line to be deleted", b"last line", ], ) self.assertEqual( k.get_lines(1), [ b"first line", b"last line", ], ) class CannedReplacement(TestBase): """Unpack canned weave with deleted lines.""" def runTest(self): k = Weave() k._parents = [ frozenset(), frozenset([0]), ] k._weave = [ (b"{", 0), b"first line", (b"[", 1), b"line to be deleted", (b"]", 1), (b"{", 1), b"replacement line", (b"}", 1), b"last line", (b"}", 0), ] k._sha1s = [ sha_string(b"first lineline to be deletedlast line"), sha_string(b"first linereplacement linelast line"), ] self.assertEqual( k.get_lines(0), [ b"first line", b"line to be deleted", b"last line", ], ) self.assertEqual( k.get_lines(1), [ b"first line", b"replacement line", b"last line", ], ) class BadWeave(TestBase): """Test that we trap an insert which should not occur.""" def runTest(self): k = Weave() k._parents = [ frozenset(), ] k._weave = [ b"bad line", (b"{", 0), b"foo {", (b"{", 1), b" added in version 1", (b"{", 2), b" added in v2", (b"}", 2), b" also from v1", (b"}", 1), b"}", (b"}", 0), ] # SKIPPED # Weave.get doesn't trap this anymore return self.assertRaises(WeaveFormatError, k.get, 0) class BadInsert(TestBase): """Test that we trap an insert which should not occur.""" def runTest(self): k = Weave() k._parents = [ frozenset(), frozenset([0]), frozenset([0]), frozenset([0, 1, 2]), ] k._weave = [ (b"{", 0), b"foo {", (b"{", 1), b" added in version 1", (b"{", 1), b" more in 1", (b"}", 1), (b"}", 1), (b"}", 0), ] # this is not currently enforced by get return self.assertRaises(WeaveFormatError, k.get, 0) self.assertRaises(WeaveFormatError, k.get, 1) class InsertNested(TestBase): """Insertion with nested instructions.""" def runTest(self): k = Weave() k._parents = [ frozenset(), frozenset([0]), frozenset([0]), frozenset([0, 1, 2]), ] k._weave = [ (b"{", 0), b"foo {", (b"{", 1), b" added in version 1", (b"{", 2), b" added in v2", (b"}", 2), b" also from v1", (b"}", 1), b"}", (b"}", 0), ] k._sha1s = [ sha_string(b"foo {}"), sha_string(b"foo { added in version 1 also from v1}"), sha_string(b"foo { added in v2}"), sha_string(b"foo { added in version 1 added in v2 also from v1}"), ] self.assertEqual(k.get_lines(0), [b"foo {", b"}"]) self.assertEqual( k.get_lines(1), [b"foo {", b" added in version 1", b" also from v1", b"}"] ) self.assertEqual(k.get_lines(2), [b"foo {", b" added in v2", b"}"]) self.assertEqual( k.get_lines(3), [ b"foo {", b" added in version 1", b" added in v2", b" also from v1", b"}", ], ) class DeleteLines2(TestBase): """Test recording revisions that delete lines. This relies on the weave having a way to represent lines knocked out by a later revision. """ def runTest(self): k = Weave() k.add_lines(b"text0", [], [b"line the first", b"line 2", b"line 3", b"fine"]) self.assertEqual(len(k.get_lines(0)), 4) k.add_lines(b"text1", [b"text0"], [b"line the first", b"fine"]) self.assertEqual(k.get_lines(1), [b"line the first", b"fine"]) self.assertEqual( k.annotate(b"text1"), [(b"text0", b"line the first"), (b"text0", b"fine")] ) class IncludeVersions(TestBase): """Check texts that are stored across multiple revisions. Here we manually create a weave with particular encoding and make sure it unpacks properly. Text 0 includes nothing; text 1 includes text 0 and adds some lines. """ def runTest(self): k = Weave() k._parents = [frozenset(), frozenset([0])] k._weave = [ (b"{", 0), b"first line", (b"}", 0), (b"{", 1), b"second line", (b"}", 1), ] k._sha1s = [sha_string(b"first line"), sha_string(b"first linesecond line")] self.assertEqual(k.get_lines(1), [b"first line", b"second line"]) self.assertEqual(k.get_lines(0), [b"first line"]) class DivergedIncludes(TestBase): """Weave with two diverged texts based on version 0.""" def runTest(self): # FIXME make the weave, dont poke at it. k = Weave() k._names = [b"0", b"1", b"2"] k._name_map = {b"0": 0, b"1": 1, b"2": 2} k._parents = [ frozenset(), frozenset([0]), frozenset([0]), ] k._weave = [ (b"{", 0), b"first line", (b"}", 0), (b"{", 1), b"second line", (b"}", 1), (b"{", 2), b"alternative second line", (b"}", 2), ] k._sha1s = [ sha_string(b"first line"), sha_string(b"first linesecond line"), sha_string(b"first linealternative second line"), ] self.assertEqual(k.get_lines(0), [b"first line"]) self.assertEqual(k.get_lines(1), [b"first line", b"second line"]) self.assertEqual(k.get_lines(b"2"), [b"first line", b"alternative second line"]) self.assertEqual(set(k.get_ancestry([b"2"])), {b"0", b"2"}) class ReplaceLine(TestBase): def runTest(self): k = Weave() text0 = [b"cheddar", b"stilton", b"gruyere"] text1 = [b"cheddar", b"blue vein", b"neufchatel", b"chevre"] k.add_lines(b"text0", [], text0) k.add_lines(b"text1", [b"text0"], text1) self.log("k._weave=" + pformat(k._weave)) self.assertEqual(k.get_lines(0), text0) self.assertEqual(k.get_lines(1), text1) class Merge(TestBase): """Storage of versions that merge diverged parents.""" def runTest(self): k = Weave() texts = [ [b"header"], [b"header", b"", b"line from 1"], [b"header", b"", b"line from 2", b"more from 2"], [b"header", b"", b"line from 1", b"fixup line", b"line from 2"], ] k.add_lines(b"text0", [], texts[0]) k.add_lines(b"text1", [b"text0"], texts[1]) k.add_lines(b"text2", [b"text0"], texts[2]) k.add_lines(b"merge", [b"text0", b"text1", b"text2"], texts[3]) for i, t in enumerate(texts): self.assertEqual(k.get_lines(i), t) self.assertEqual( k.annotate(b"merge"), [ (b"text0", b"header"), (b"text1", b""), (b"text1", b"line from 1"), (b"merge", b"fixup line"), (b"text2", b"line from 2"), ], ) self.assertEqual( set(k.get_ancestry([b"merge"])), {b"text0", b"text1", b"text2", b"merge"} ) self.log("k._weave=" + pformat(k._weave)) self.check_read_write(k) class Conflicts(TestBase): """Test detection of conflicting regions during a merge. A base version is inserted, then two descendents try to insert different lines in the same place. These should be reported as a possible conflict and forwarded to the user. """ def runTest(self): return # NOT RUN k = Weave() k.add_lines([], [b"aaa", b"bbb"]) k.add_lines([0], [b"aaa", b"111", b"bbb"]) k.add_lines([1], [b"aaa", b"222", b"bbb"]) k.merge([1, 2]) self.assertEqual([[[b"aaa"]], [[b"111"], [b"222"]], [[b"bbb"]]]) class NonConflict(TestBase): """Two descendants insert compatible changes. No conflict should be reported. """ def runTest(self): return # NOT RUN k = Weave() k.add_lines([], [b"aaa", b"bbb"]) k.add_lines([0], [b"111", b"aaa", b"ccc", b"bbb"]) k.add_lines([1], [b"aaa", b"ccc", b"bbb", b"222"]) class Khayyam(TestBase): """Test changes to multi-line texts, and read/write.""" def test_multi_line_merge(self): rawtexts = [ b"""A Book of Verses underneath the Bough, A Jug of Wine, a Loaf of Bread, -- and Thou Beside me singing in the Wilderness -- Oh, Wilderness were Paradise enow!""", b"""A Book of Verses underneath the Bough, A Jug of Wine, a Loaf of Bread, -- and Thou Beside me singing in the Wilderness -- Oh, Wilderness were Paradise now!""", b"""A Book of poems underneath the tree, A Jug of Wine, a Loaf of Bread, and Thou Beside me singing in the Wilderness -- Oh, Wilderness were Paradise now! -- O. Khayyam""", b"""A Book of Verses underneath the Bough, A Jug of Wine, a Loaf of Bread, and Thou Beside me singing in the Wilderness -- Oh, Wilderness were Paradise now!""", ] texts = [[l.strip() for l in t.split(b"\n")] for t in rawtexts] k = Weave() parents = set() for i, t in enumerate(texts): k.add_lines(b"text%d" % i, list(parents), t) parents.add(b"text%d" % i) self.log("k._weave=" + pformat(k._weave)) for i, t in enumerate(texts): self.assertEqual(k.get_lines(i), t) self.check_read_write(k) class JoinWeavesTests(TestBase): def setUp(self): super().setUp() self.weave1 = Weave() self.lines1 = [b"hello\n"] self.lines3 = [b"hello\n", b"cruel\n", b"world\n"] self.weave1.add_lines(b"v1", [], self.lines1) self.weave1.add_lines(b"v2", [b"v1"], [b"hello\n", b"world\n"]) self.weave1.add_lines(b"v3", [b"v2"], self.lines3) def test_written_detection(self): # Test detection of weave file corruption. # # Make sure that we can detect if a weave file has # been corrupted. This doesn't test all forms of corruption, # but it at least helps verify the data you get, is what you want. w = Weave() w.add_lines(b"v1", [], [b"hello\n"]) w.add_lines(b"v2", [b"v1"], [b"hello\n", b"there\n"]) tmpf = BytesIO() write_weave(w, tmpf) # Because we are corrupting, we need to make sure we have the exact # text self.assertEqual( b"# bzr weave file v5\n" b"i\n1 f572d396fae9206628714fb2ce00f72e94f2258f\nn v1\n\n" b"i 0\n1 90f265c6e75f1c8f9ab76dcf85528352c5f215ef\nn v2\n\n" b"w\n{ 0\n. hello\n}\n{ 1\n. there\n}\nW\n", tmpf.getvalue(), ) # Change a single letter tmpf = BytesIO( b"# bzr weave file v5\n" b"i\n1 f572d396fae9206628714fb2ce00f72e94f2258f\nn v1\n\n" b"i 0\n1 90f265c6e75f1c8f9ab76dcf85528352c5f215ef\nn v2\n\n" b"w\n{ 0\n. hello\n}\n{ 1\n. There\n}\nW\n" ) w = read_weave(tmpf) self.assertEqual(b"hello\n", w.get_text(b"v1")) self.assertRaises(WeaveInvalidChecksum, w.get_text, b"v2") self.assertRaises(WeaveInvalidChecksum, w.get_lines, b"v2") self.assertRaises(WeaveInvalidChecksum, w.check) # Change the sha checksum tmpf = BytesIO( b"# bzr weave file v5\n" b"i\n1 f572d396fae9206628714fb2ce00f72e94f2258f\nn v1\n\n" b"i 0\n1 f0f265c6e75f1c8f9ab76dcf85528352c5f215ef\nn v2\n\n" b"w\n{ 0\n. hello\n}\n{ 1\n. there\n}\nW\n" ) w = read_weave(tmpf) self.assertEqual(b"hello\n", w.get_text(b"v1")) self.assertRaises(WeaveInvalidChecksum, w.get_text, b"v2") self.assertRaises(WeaveInvalidChecksum, w.get_lines, b"v2") self.assertRaises(WeaveInvalidChecksum, w.check) class TestWeave(TestCase): def test_allow_reserved_false(self): w = Weave("name", allow_reserved=False) # Add lines is checked at the WeaveFile level, not at the Weave level w.add_lines(b"name:", [], TEXT_1) # But get_lines is checked at this level self.assertRaises(ReservedId, w.get_lines, b"name:") def test_allow_reserved_true(self): w = Weave("name", allow_reserved=True) w.add_lines(b"name:", [], TEXT_1) self.assertEqual(TEXT_1, w.get_lines(b"name:")) class InstrumentedWeave(Weave): """Keep track of how many times functions are called.""" def __init__(self, weave_name=None): self._extract_count = 0 Weave.__init__(self, weave_name=weave_name) def _extract(self, versions): self._extract_count += 1 return Weave._extract(self, versions) class TestNeedsReweave(TestCase): """Internal corner cases for when reweave is needed.""" def test_compatible_parents(self): w1 = Weave("a") my_parents = {1, 2, 3} # subsets are ok self.assertTrue(w1._compatible_parents(my_parents, {3})) # same sets self.assertTrue(w1._compatible_parents(my_parents, set(my_parents))) # same empty corner case self.assertTrue(w1._compatible_parents(set(), set())) # other cannot contain stuff my_parents does not self.assertFalse(w1._compatible_parents(set(), {1})) self.assertFalse(w1._compatible_parents(my_parents, {1, 2, 3, 4})) self.assertFalse(w1._compatible_parents(my_parents, {4})) class TestWeaveFile(TestCaseInTempDir): def test_empty_file(self): with open("empty.weave", "wb+") as f: self.assertRaises(WeaveFormatError, read_weave, f) bzrformats_3.4.0.orig/bzrformats/tests/test_xml.py0000644000000000000000000005644415162115103017432 0ustar00# Copyright (C) 2005-2011 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA from io import BytesIO import bzrformats.xml5 from bzrformats import inventory, serializer, xml6, xml7, xml8 from bzrformats.inventory import Inventory from .. import osutils from ..revision import Revision from . import TestCase _revision_v5 = b""" - start splitting code for xml (de)serialization away from objects preparatory to supporting multiple formats by a single library """ _revision_v5_utc = b"""\ - start splitting code for xml (de)serialization away from objects preparatory to supporting multiple formats by a single library """ _committed_inv_v5 = b""" """ _basis_inv_v5 = b""" """ # DO NOT REFLOW THIS. Its the exact revision we want. _expected_rev_v5 = b""" - start splitting code for xml (de)serialization away from objects preparatory to supporting multiple formats by a single library """ # DO NOT REFLOW THIS. Its the exact inventory we want. _expected_inv_v5 = b""" """ _expected_inv_v5_root = b""" """ _expected_inv_v6 = b""" """ _expected_inv_v7 = b""" """ _expected_rev_v8 = b""" - start splitting code for xml (de)serialization away from objects preparatory to supporting multiple formats by a single library """ _expected_inv_v8 = b""" """ _revision_utf8_v5 = b""" Include µnicode characters """ _expected_rev_v8_complex = b""" Include µnicode characters this has a newline in it """ _inventory_utf8_v5 = b""" """ # Before revision_id was always stored as an attribute _inventory_v5a = b""" """ # Before revision_id was always stored as an attribute _inventory_v5b = b""" """ class TestSerializer(TestCase): """Test XML serialization.""" def test_unpack_revision_5(self): """Test unpacking a canned revision v5.""" inp = BytesIO(_revision_v5) rev = bzrformats.xml5.revision_serializer_v5.read_revision(inp) eq = self.assertEqual eq(rev.committer, "Martin Pool ") eq(len(rev.parent_ids), 1) eq(rev.timezone, 36000) eq(rev.parent_ids[0], b"mbp@sourcefrog.net-20050905063503-43948f59fa127d92") def test_unpack_revision_5_utc(self): inp = BytesIO(_revision_v5_utc) rev = bzrformats.xml5.revision_serializer_v5.read_revision(inp) eq = self.assertEqual eq(rev.committer, "Martin Pool ") eq(len(rev.parent_ids), 1) eq(rev.timezone, 0) eq(rev.parent_ids[0], b"mbp@sourcefrog.net-20050905063503-43948f59fa127d92") def test_unpack_inventory_5(self): """Unpack canned new-style inventory.""" inp = BytesIO(_committed_inv_v5) inv = bzrformats.xml5.inventory_serializer_v5.read_inventory(inp) eq = self.assertEqual eq(len(inv), 4) ie = inv.get_entry(b"bar-20050824000535-6bc48cfad47ed134") eq(ie.kind, "file") eq(ie.revision, b"mbp@foo-00") eq(ie.name, "bar") eq(inv.get_entry(ie.parent_id).kind, "directory") def test_unpack_basis_inventory_5(self): """Unpack canned new-style inventory.""" inv = bzrformats.xml5.inventory_serializer_v5.read_inventory_from_lines( osutils.split_lines(_basis_inv_v5) ) eq = self.assertEqual eq(len(inv), 4) eq(inv.revision_id, b"mbp@sourcefrog.net-20050905063503-43948f59fa127d92") ie = inv.get_entry(b"bar-20050824000535-6bc48cfad47ed134") eq(ie.kind, "file") eq(ie.revision, b"mbp@foo-00") eq(ie.name, "bar") eq(inv.get_entry(ie.parent_id).kind, "directory") def test_unpack_inventory_5a(self): inv = bzrformats.xml5.inventory_serializer_v5.read_inventory_from_lines( osutils.split_lines(_inventory_v5a), revision_id=b"test-rev-id" ) self.assertEqual(b"test-rev-id", inv.root.revision) def test_unpack_inventory_5b(self): inv = bzrformats.xml5.inventory_serializer_v5.read_inventory_from_lines( osutils.split_lines(_inventory_v5b), revision_id=b"test-rev-id" ) self.assertEqual(b"a-rev-id", inv.root.revision) def test_repack_inventory_5(self): inv = bzrformats.xml5.inventory_serializer_v5.read_inventory_from_lines( osutils.split_lines(_committed_inv_v5) ) outp = BytesIO() bzrformats.xml5.inventory_serializer_v5.write_inventory(inv, outp) self.assertEqualDiff(_expected_inv_v5, outp.getvalue()) inv2 = bzrformats.xml5.inventory_serializer_v5.read_inventory_from_lines( osutils.split_lines(outp.getvalue()) ) self.assertEqual(inv, inv2) def assertRoundTrips(self, xml_string): inp = BytesIO(xml_string) inv = bzrformats.xml5.inventory_serializer_v5.read_inventory(inp) outp = BytesIO() bzrformats.xml5.inventory_serializer_v5.write_inventory(inv, outp) self.assertEqualDiff(xml_string, outp.getvalue()) lines = bzrformats.xml5.inventory_serializer_v5.write_inventory_to_lines(inv) outp.seek(0) self.assertEqual(outp.readlines(), lines) inv2 = bzrformats.xml5.inventory_serializer_v5.read_inventory( BytesIO(outp.getvalue()) ) self.assertEqual(inv, inv2) def tests_serialize_inventory_v5_with_root(self): self.assertRoundTrips(_expected_inv_v5_root) def check_repack_revision(self, txt): """Check that repacking a revision yields the same information.""" inp = BytesIO(txt) rev = bzrformats.xml5.revision_serializer_v5.read_revision(inp) outfile_contents = ( bzrformats.xml5.revision_serializer_v5.write_revision_to_string(rev) ) rev2 = bzrformats.xml5.revision_serializer_v5.read_revision( BytesIO(outfile_contents) ) self.assertEqual(rev, rev2) def test_repack_revision_5(self): """Round-trip revision to XML v5.""" self.check_repack_revision(_revision_v5) def test_repack_revision_5_utc(self): self.check_repack_revision(_revision_v5_utc) def test_pack_revision_5(self): """Pack revision to XML v5.""" # fixed 20051025, revisions should have final newline rev = bzrformats.xml5.revision_serializer_v5.read_revision_from_string( _revision_v5 ) outfile_contents = ( bzrformats.xml5.revision_serializer_v5.write_revision_to_string(rev) ) self.assertEqual(outfile_contents[-1:], b"\n") self.assertEqualDiff( outfile_contents, b"".join( bzrformats.xml5.revision_serializer_v5.write_revision_to_lines(rev) ), ) self.assertEqualDiff(outfile_contents, _expected_rev_v5) def test_empty_property_value(self): """Create an empty property value check that it serializes correctly.""" s_v5 = bzrformats.xml5.revision_serializer_v5 rev = s_v5.read_revision_from_string(_revision_v5) props = {"empty": "", "one": "one"} rev = Revision( revision_id=rev.revision_id, timestamp=rev.timestamp, timezone=rev.timezone, committer=rev.committer, message=rev.message, parent_ids=rev.parent_ids, inventory_sha1=rev.inventory_sha1, properties=props, ) txt = b"".join(s_v5.write_revision_to_lines(rev)) new_rev = s_v5.read_revision_from_string(txt) self.assertEqual(props, new_rev.properties) def get_sample_inventory(self): inv = Inventory(root_id=None, revision_id=b"rev_outer") inv.add(inventory.InventoryDirectory(b"tree-root-321", "", None, b"rev_outer")) inv.add( inventory.InventoryFile( b"file-id", "file", b"tree-root-321", b"rev_outer", text_sha1=b"A", text_size=1, ) ) inv.add( inventory.InventoryDirectory( b"dir-id", "dir", b"tree-root-321", b"rev_outer" ) ) inv.add( inventory.InventoryLink( b"link-id", "link", b"tree-root-321", b"rev_outer", symlink_target="a" ) ) return inv def test_roundtrip_inventory_v7(self): inv = self.get_sample_inventory() inv.add( inventory.TreeReference( b"nested-id", "nested", b"tree-root-321", b"rev_outer", b"rev_inner" ) ) lines = xml7.inventory_serializer_v7.write_inventory_to_lines(inv) self.assertEqualDiff(_expected_inv_v7, b"".join(lines)) inv2 = xml7.inventory_serializer_v7.read_inventory_from_lines(lines) self.assertEqual(5, len(inv2)) for _path, ie in inv.iter_entries(): self.assertEqual(ie, inv2.get_entry(ie.file_id)) def test_roundtrip_inventory_v6(self): inv = self.get_sample_inventory() lines = xml6.inventory_serializer_v6.write_inventory_to_lines(inv) self.assertEqualDiff(_expected_inv_v6, b"".join(lines)) inv2 = xml6.inventory_serializer_v6.read_inventory_from_lines(lines) self.assertEqual(4, len(inv2)) for _path, ie in inv.iter_entries(): self.assertEqual(ie, inv2.get_entry(ie.file_id)) def test_wrong_format_v7(self): """Can't accidentally open a file with wrong serializer.""" s_v6 = bzrformats.xml6.inventory_serializer_v6 s_v7 = xml7.inventory_serializer_v7 self.assertRaises( serializer.UnexpectedInventoryFormat, s_v7.read_inventory_from_lines, osutils.split_lines(_expected_inv_v5), ) self.assertRaises( serializer.UnexpectedInventoryFormat, s_v6.read_inventory_from_lines, osutils.split_lines(_expected_inv_v7), ) def test_tree_reference(self): s_v5 = bzrformats.xml5.inventory_serializer_v5 s_v6 = bzrformats.xml6.inventory_serializer_v6 s_v7 = xml7.inventory_serializer_v7 inv = Inventory( b"tree-root-321", revision_id=b"rev-outer", root_revision=b"root-rev" ) inv.add( inventory.TreeReference( b"nested-id", "nested", b"tree-root-321", b"rev-outer", b"rev-inner" ) ) self.assertRaises( serializer.UnsupportedInventoryKind, s_v5.write_inventory_to_lines, inv ) self.assertRaises( serializer.UnsupportedInventoryKind, s_v6.write_inventory_to_lines, inv ) lines = s_v7.write_inventory_to_chunks(inv) inv2 = s_v7.read_inventory_from_lines(lines) self.assertEqual(b"tree-root-321", inv2.get_entry(b"nested-id").parent_id) self.assertEqual(b"rev-outer", inv2.get_entry(b"nested-id").revision) self.assertEqual(b"rev-inner", inv2.get_entry(b"nested-id").reference_revision) def test_roundtrip_inventory_v8(self): inv = self.get_sample_inventory() lines = xml8.inventory_serializer_v8.write_inventory_to_lines(inv) inv2 = xml8.inventory_serializer_v8.read_inventory_from_lines(lines) self.assertEqual(4, len(inv2)) for _path, ie in inv.iter_entries(): self.assertEqual(ie, inv2.get_entry(ie.file_id)) def test_inventory_text_v8(self): inv = self.get_sample_inventory() lines = xml8.inventory_serializer_v8.write_inventory_to_lines(inv) self.assertEqualDiff(_expected_inv_v8, b"".join(lines)) def test_revision_text_v5(self): """Pack revision to XML v7.""" rev = bzrformats.xml5.revision_serializer_v5.read_revision_from_string( _expected_rev_v5 ) serialized = bzrformats.xml5.revision_serializer_v5.write_revision_to_lines(rev) self.assertEqualDiff(b"".join(serialized), _expected_rev_v5) def test_revision_text_v8(self): """Pack revision to XML v8.""" rev = bzrformats.xml8.revision_serializer_v8.read_revision_from_string( _expected_rev_v8 ) serialized = bzrformats.xml8.revision_serializer_v8.write_revision_to_lines(rev) self.assertEqualDiff(b"".join(serialized), _expected_rev_v8) def test_revision_text_v8_complex(self): """Pack revision to XML v8.""" rev = bzrformats.xml8.revision_serializer_v8.read_revision_from_string( _expected_rev_v8_complex ) serialized = bzrformats.xml8.revision_serializer_v8.write_revision_to_lines(rev) self.assertEqualDiff(b"".join(serialized), _expected_rev_v8_complex) def test_revision_ids_are_utf8(self): """Parsed revision_ids should all be utf-8 strings, not unicode.""" sr_v5 = bzrformats.xml5.revision_serializer_v5 si_v5 = bzrformats.xml5.inventory_serializer_v5 rev = sr_v5.read_revision_from_string(_revision_utf8_v5) self.assertEqual(b"erik@b\xc3\xa5gfors-02", rev.revision_id) self.assertIsInstance(rev.revision_id, bytes) self.assertEqual([b"erik@b\xc3\xa5gfors-01"], rev.parent_ids) for parent_id in rev.parent_ids: self.assertIsInstance(parent_id, bytes) self.assertEqual("Include \xb5nicode characters\n", rev.message) self.assertIsInstance(rev.message, str) # ie.revision should either be None or a utf-8 revision id inv = si_v5.read_inventory_from_lines(osutils.split_lines(_inventory_utf8_v5)) rev_id_1 = "erik@b\xe5gfors-01".encode() rev_id_2 = "erik@b\xe5gfors-02".encode() fid_root = "TRE\xe9_ROOT".encode() fid_bar1 = "b\xe5r-01".encode() fid_sub = "s\xb5bdir-01".encode() fid_bar2 = "b\xe5r-02".encode() expected = [ ("", fid_root, None, rev_id_2), ("b\xe5r", fid_bar1, fid_root, rev_id_1), ("s\xb5bdir", fid_sub, fid_root, rev_id_1), ("s\xb5bdir/b\xe5r", fid_bar2, fid_sub, rev_id_2), ] self.assertEqual(rev_id_2, inv.revision_id) self.assertIsInstance(inv.revision_id, bytes) actual = list(inv.iter_entries_by_dir()) for (exp_path, exp_file_id, exp_parent_id, exp_rev_id), ( act_path, act_ie, ) in zip(expected, actual, strict=False): self.assertEqual(exp_path, act_path) self.assertIsInstance(act_path, str) self.assertEqual(exp_file_id, act_ie.file_id) self.assertIsInstance(act_ie.file_id, bytes) self.assertEqual(exp_parent_id, act_ie.parent_id) if exp_parent_id is not None: self.assertIsInstance(act_ie.parent_id, bytes) self.assertEqual(exp_rev_id, act_ie.revision) if exp_rev_id is not None: self.assertIsInstance(act_ie.revision, bytes) self.assertEqual(len(expected), len(actual)) def test_serialization_error(self): s_v5 = bzrformats.xml5.inventory_serializer_v5 e = self.assertRaises( serializer.UnexpectedInventoryFormat, s_v5.read_inventory_from_lines, [b""), ) def test_utf8_with_xml(self): # u'\xb5\xe5&\u062c' utf8_str = b"\xc2\xb5\xc3\xa5&\xd8\xac" self.assertEqual( b"µå&ج", bzrformats.xml_serializer.encode_and_escape(utf8_str), ) def test_unicode(self): uni_str = "\xb5\xe5&\u062c" self.assertEqual( b"µå&ج", bzrformats.xml_serializer.encode_and_escape(uni_str), ) class TestMisc(TestCase): def test_unescape_xml(self): """We get some kind of error when malformed entities are passed.""" self.assertRaises(KeyError, bzrformats.xml8._unescape_xml, b"foo&bar;") bzrformats_3.4.0.orig/bzrformats/tests/per_inventory/__init__.py0000644000000000000000000000450315162115103022222 0ustar00# Copyright (C) 2005, 2006, 2007 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for different inventory implementations.""" from testscenarios import load_tests_apply_scenarios from bzrformats import groupcompress from bzrformats.inventory import CHKInventory, Inventory from .. import TestCaseWithMemoryTransport def _inv_to_chk_inv(test, inv): """CHKInventory needs a backing VF, so we create one.""" factory = groupcompress.make_pack_factory(True, True, 1) trans = test.get_transport("chk-inv") trans.ensure_base() vf = factory(trans) chk_inv = CHKInventory.from_inventory( vf, inv, maximum_size=100, search_key_name=b"hash-255-way" ) return chk_inv def load_tests(loader, basic_tests, pattern): suite = loader.loadTestsFromName("bzrformats.tests.per_inventory.basics") return load_tests_apply_scenarios(loader, suite, pattern) class TestCaseWithInventory(TestCaseWithMemoryTransport): scenarios = [ ( "Inventory", {"_inventory_class": Inventory, "_inv_to_test_inv": lambda test, inv: inv}, ), ( "CHKInventory", { "_inventory_class": CHKInventory, "_inv_to_test_inv": _inv_to_chk_inv, }, ), ] _inventory_class = None # set by scenarios _inv_to_test_inv = None # set by scenarios def make_test_inventory(self): """Return an instance of the Inventory class under test.""" return self._inventory_class() def inv_to_test_inv(self, inv): """Convert a regular Inventory object into an inventory under test.""" return self._inv_to_test_inv(self, inv) bzrformats_3.4.0.orig/bzrformats/tests/per_inventory/basics.py0000644000000000000000000004174515162115103021740 0ustar00# Copyright (C) 2005, 2006, 2007 Canonical Ltd # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA """Tests for different inventory implementations.""" # NOTE: Don't import Inventory here, to make sure that we don't accidentally # hardcode that when we should be using self.make_inventory from bzrformats import inventory, osutils from bzrformats.errors import InconsistentDelta from bzrformats.inventory import NoSuchId from bzrformats.tests.per_inventory import TestCaseWithInventory from ...inventory import InventoryFile, InventoryLink from ...inventory_delta import InventoryDelta class TestInventory(TestCaseWithInventory): def make_init_inventory(self): inv = inventory.Inventory(root_id=None, revision_id=b"initial-rev") root = inventory.InventoryDirectory(b"tree-root", "", None, b"initial-rev") inv.add(root) return self.inv_to_test_inv(inv) def make_file( self, file_id, name, parent_id, content=b"content\n", revision=b"new-test-rev" ): return InventoryFile( file_id, name, parent_id, text_sha1=osutils.sha_string(content), text_size=len(content), revision=revision, ) def make_link(self, file_id, name, parent_id, target="link-target\n"): return InventoryLink(file_id, name, parent_id, symlink_target=target) def prepare_inv_with_nested_dirs(self): inv = inventory.Inventory(root_id=None) root = inventory.InventoryDirectory(b"tree-root", "", None, b"revision") inv.add(root) for args in [ ("src", "directory", b"src-id"), ("doc", "directory", b"doc-id"), ("src/hello.c", "file", b"hello-id"), ("src/bye.c", "file", b"bye-id"), ("zz", "file", b"zz-id"), ("src/sub/", "directory", b"sub-id"), ("src/zz.c", "file", b"zzc-id"), ("src/sub/a", "file", b"a-id"), ("Makefile", "file", b"makefile-id"), ]: kwargs = {} if args[1] == "file": kwargs["text_sha1"] = osutils.sha_string(b"content\n") kwargs["text_size"] = len(b"content\n") inv.add_path(*args, revision=b"revision", **kwargs) return self.inv_to_test_inv(inv) class TestInventoryCreateByApplyDelta(TestInventory): """A subset of the inventory delta application tests. See test_inv which has comprehensive delta application tests for inventories, dirstate, and repository based inventories. """ def test_add(self): inv = self.make_init_inventory() inv = inv.create_by_apply_delta( InventoryDelta( [ (None, "a", b"a-id", self.make_file(b"a-id", "a", b"tree-root")), ] ), b"new-test-rev", ) self.assertEqual("a", inv.id2path(b"a-id")) def test_delete(self): inv = self.make_init_inventory() inv = inv.create_by_apply_delta( InventoryDelta( [ (None, "a", b"a-id", self.make_file(b"a-id", "a", b"tree-root")), ] ), b"new-rev-1", ) self.assertEqual("a", inv.id2path(b"a-id")) inv = inv.create_by_apply_delta( InventoryDelta( [ ("a", None, b"a-id", None), ] ), b"new-rev-2", ) self.assertRaises(NoSuchId, inv.id2path, b"a-id") def test_rename(self): inv = self.make_init_inventory() inv = inv.create_by_apply_delta( InventoryDelta( [ (None, "a", b"a-id", self.make_file(b"a-id", "a", b"tree-root")), ] ), b"new-rev-1", ) self.assertEqual("a", inv.id2path(b"a-id")) a_ie = inv.get_entry(b"a-id") b_ie = self.make_file(a_ie.file_id, "b", a_ie.parent_id) inv = inv.create_by_apply_delta( InventoryDelta([("a", "b", b"a-id", b_ie)]), b"new-rev-2" ) self.assertEqual("b", inv.id2path(b"a-id")) def test_illegal(self): # A file-id cannot appear in a delta more than once inv = self.make_init_inventory() self.assertRaises( InconsistentDelta, inv.create_by_apply_delta, InventoryDelta( [ (None, "a", b"id-1", self.make_file(b"id-1", "a", b"tree-root")), (None, "b", b"id-1", self.make_file(b"id-1", "b", b"tree-root")), ] ), b"new-rev-1", ) class TestInventoryReads(TestInventory): def test_is_root(self): """Ensure our root-checking code is accurate.""" inv = self.make_init_inventory() self.assertTrue(inv.is_root(b"tree-root")) self.assertFalse(inv.is_root(b"booga")) ie = inventory.InventoryDirectory( b"booga", "", None, revision=inv.root.revision ) inv = inv.create_by_apply_delta( InventoryDelta([("", None, b"tree-root", None), (None, "", b"booga", ie)]), b"new-rev-2", ) self.assertFalse(inv.is_root(b"TREE_ROOT")) self.assertTrue(inv.is_root(b"booga")) def test_ids(self): """Test detection of files within selected directories.""" inv = inventory.Inventory(root_id=None) root = inventory.InventoryDirectory(b"tree-root", "", None, b"revision") inv.add(root) for args in [ ("src", "directory", b"src-id"), ("doc", "directory", b"doc-id"), ("src/hello.c", "file"), ("src/bye.c", "file", b"bye-id"), ("Makefile", "file"), ]: kwargs = {} if args[1] == "file": kwargs["text_sha1"] = osutils.sha_string(b"content\n") kwargs["text_size"] = len(b"content\n") inv.add_path(*args, revision=b"revision", **kwargs) inv = self.inv_to_test_inv(inv) self.assertEqual(inv.path2id("src"), b"src-id") self.assertEqual(inv.path2id("src/bye.c"), b"bye-id") def test_get_entry_by_path_partial(self): inv = inventory.Inventory(root_id=None) root = inventory.InventoryDirectory(b"TREE_ROOT", "", None, b"revision") inv.add(root) for args in [ ("src", "directory", b"src-id"), ("doc", "directory", b"doc-id"), ("src/hello.c", "file"), ("src/bye.c", "file", b"bye-id"), ("Makefile", "file"), ("external", "tree-reference", b"other-root"), ]: kwargs = {} if args[1] == "file": kwargs["text_sha1"] = osutils.sha_string(b"content\n") kwargs["text_size"] = len(b"content\n") if args[1] == "tree-reference": kwargs["reference_revision"] = b"reference" ie = inv.add_path(*args, revision=b"revision", **kwargs) inv = self.inv_to_test_inv(inv) # Standard lookups ie, resolved, remaining = inv.get_entry_by_path_partial("") self.assertEqual((ie.file_id, resolved, remaining), (b"TREE_ROOT", [], [])) ie, resolved, remaining = inv.get_entry_by_path_partial("src") self.assertEqual((ie.file_id, resolved, remaining), (b"src-id", ["src"], [])) ie, resolved, remaining = inv.get_entry_by_path_partial("src/bye.c") self.assertEqual( (ie.file_id, resolved, remaining), (b"bye-id", ["src", "bye.c"], []) ) # Paths in the external tree ie, resolved, remaining = inv.get_entry_by_path_partial("external") self.assertEqual( (ie.file_id, resolved, remaining), (b"other-root", ["external"], []) ) ie, resolved, remaining = inv.get_entry_by_path_partial("external/blah") self.assertEqual( (ie.file_id, resolved, remaining), (b"other-root", ["external"], ["blah"]) ) # Nonexistant paths ie, resolved, remaining = inv.get_entry_by_path_partial("foo.c") self.assertEqual((ie, resolved, remaining), (None, None, None)) def test_non_directory_children(self): """Test path2id when a parent directory has no children.""" inv = inventory.Inventory(b"tree-root") inv.add(self.make_file(b"file-id", "file", b"tree-root")) inv.add(self.make_link(b"link-id", "link", b"tree-root")) self.assertIs(None, inv.path2id("file/subfile")) self.assertIs(None, inv.path2id("link/subfile")) def test_is_unmodified(self): f1 = self.make_file(b"file-id", "file", b"tree-root", revision=b"rev") self.assertTrue(f1.is_unmodified(f1)) f2 = self.make_file(b"file-id", "file", b"tree-root", revision=b"rev") self.assertTrue(f1.is_unmodified(f2)) f3 = self.make_file(b"file-id", "file", b"tree-root") self.assertFalse(f1.is_unmodified(f3)) f4 = self.make_file(b"file-id", "file", b"tree-root", revision=b"rev1") self.assertFalse(f1.is_unmodified(f4)) def test_iter_entries(self): inv = self.prepare_inv_with_nested_dirs() # Test all entries self.assertEqual( [ ("", b"tree-root"), ("Makefile", b"makefile-id"), ("doc", b"doc-id"), ("src", b"src-id"), ("src/bye.c", b"bye-id"), ("src/hello.c", b"hello-id"), ("src/sub", b"sub-id"), ("src/sub/a", b"a-id"), ("src/zz.c", b"zzc-id"), ("zz", b"zz-id"), ], [(path, ie.file_id) for path, ie in inv.iter_entries()], ) # Test a subdirectory self.assertEqual( [ ("bye.c", b"bye-id"), ("hello.c", b"hello-id"), ("sub", b"sub-id"), ("sub/a", b"a-id"), ("zz.c", b"zzc-id"), ], [(path, ie.file_id) for path, ie in inv.iter_entries(from_dir=b"src-id")], ) # Test not recursing at the root level self.assertEqual( [ ("", b"tree-root"), ("Makefile", b"makefile-id"), ("doc", b"doc-id"), ("src", b"src-id"), ("zz", b"zz-id"), ], [(path, ie.file_id) for path, ie in inv.iter_entries(recursive=False)], ) # Test not recursing at a subdirectory level self.assertEqual( [ ("bye.c", b"bye-id"), ("hello.c", b"hello-id"), ("sub", b"sub-id"), ("zz.c", b"zzc-id"), ], [ (path, ie.file_id) for path, ie in inv.iter_entries(from_dir=b"src-id", recursive=False) ], ) def test_iter_entries_by_dir(self): inv = self.prepare_inv_with_nested_dirs() self.assertEqual( [ ("", b"tree-root"), ("Makefile", b"makefile-id"), ("doc", b"doc-id"), ("src", b"src-id"), ("zz", b"zz-id"), ("src/bye.c", b"bye-id"), ("src/hello.c", b"hello-id"), ("src/sub", b"sub-id"), ("src/zz.c", b"zzc-id"), ("src/sub/a", b"a-id"), ], [(path, ie.file_id) for path, ie in inv.iter_entries_by_dir()], ) self.assertEqual( [ ("", b"tree-root"), ("Makefile", b"makefile-id"), ("doc", b"doc-id"), ("src", b"src-id"), ("zz", b"zz-id"), ("src/bye.c", b"bye-id"), ("src/hello.c", b"hello-id"), ("src/sub", b"sub-id"), ("src/zz.c", b"zzc-id"), ("src/sub/a", b"a-id"), ], [ (path, ie.file_id) for path, ie in inv.iter_entries_by_dir( specific_file_ids={ b"a-id", b"zzc-id", b"doc-id", b"tree-root", b"hello-id", b"bye-id", b"zz-id", b"src-id", b"makefile-id", b"sub-id", } ) ], ) self.assertEqual( [ ("Makefile", b"makefile-id"), ("doc", b"doc-id"), ("zz", b"zz-id"), ("src/bye.c", b"bye-id"), ("src/hello.c", b"hello-id"), ("src/zz.c", b"zzc-id"), ("src/sub/a", b"a-id"), ], [ (path, ie.file_id) for path, ie in inv.iter_entries_by_dir( specific_file_ids={ b"a-id", b"zzc-id", b"doc-id", b"hello-id", b"bye-id", b"zz-id", b"makefile-id", } ) ], ) self.assertEqual( [ ("Makefile", b"makefile-id"), ("src/bye.c", b"bye-id"), ], [ (path, ie.file_id) for path, ie in inv.iter_entries_by_dir( specific_file_ids={b"bye-id", b"makefile-id"} ) ], ) self.assertEqual( [ ("Makefile", b"makefile-id"), ("src/bye.c", b"bye-id"), ], [ (path, ie.file_id) for path, ie in inv.iter_entries_by_dir( specific_file_ids={b"bye-id", b"makefile-id"} ) ], ) self.assertEqual( [ ("src/bye.c", b"bye-id"), ], [ (path, ie.file_id) for path, ie in inv.iter_entries_by_dir(specific_file_ids={b"bye-id"}) ], ) class TestInventoryFiltering(TestInventory): def test_inv_filter_empty(self): inv = self.prepare_inv_with_nested_dirs() new_inv = inv.filter(set()) self.assertEqual( [ ("", b"tree-root"), ], [(path, ie.file_id) for path, ie in new_inv.iter_entries()], ) def test_inv_filter_files(self): inv = self.prepare_inv_with_nested_dirs() new_inv = inv.filter({b"zz-id", b"hello-id", b"a-id"}) self.assertEqual( [ ("", b"tree-root"), ("src", b"src-id"), ("src/hello.c", b"hello-id"), ("src/sub", b"sub-id"), ("src/sub/a", b"a-id"), ("zz", b"zz-id"), ], [(path, ie.file_id) for path, ie in new_inv.iter_entries()], ) def test_inv_filter_dirs(self): inv = self.prepare_inv_with_nested_dirs() new_inv = inv.filter({b"doc-id", b"sub-id"}) self.assertEqual( [ ("", b"tree-root"), ("doc", b"doc-id"), ("src", b"src-id"), ("src/sub", b"sub-id"), ("src/sub/a", b"a-id"), ], [(path, ie.file_id) for path, ie in new_inv.iter_entries()], ) def test_inv_filter_files_and_dirs(self): inv = self.prepare_inv_with_nested_dirs() new_inv = inv.filter({b"makefile-id", b"src-id"}) self.assertEqual( [ ("", b"tree-root"), ("Makefile", b"makefile-id"), ("src", b"src-id"), ("src/bye.c", b"bye-id"), ("src/hello.c", b"hello-id"), ("src/sub", b"sub-id"), ("src/sub/a", b"a-id"), ("src/zz.c", b"zzc-id"), ], [(path, ie.file_id) for path, ie in new_inv.iter_entries()], ) def test_inv_filter_entry_not_present(self): inv = self.prepare_inv_with_nested_dirs() new_inv = inv.filter({b"not-present-id"}) self.assertEqual( [ ("", b"tree-root"), ], [(path, ie.file_id) for path, ie in new_inv.iter_entries()], ) bzrformats_3.4.0.orig/crates/bazaar-py/0000755000000000000000000000000015162074037015032 5ustar00bzrformats_3.4.0.orig/crates/bazaar/0000755000000000000000000000000015162074037014404 5ustar00bzrformats_3.4.0.orig/crates/osutils-py/0000755000000000000000000000000014414043471015271 5ustar00bzrformats_3.4.0.orig/crates/osutils-rs/0000755000000000000000000000000015162074037015270 5ustar00bzrformats_3.4.0.orig/crates/bazaar-py/Cargo.toml0000644000000000000000000000056215162074037016765 0ustar00[package] name = "bazaar-py" version = { workspace = true } edition = "2018" [lib] crate-type = ["cdylib"] [dependencies] bazaar = { path = "../bazaar", features=["pyo3"] } pyo3 = { workspace = true, features = ["extension-module", "chrono"]} pyo3-filelike = { workspace = true } chrono = { workspace = true } osutils = { path = "../osutils-rs", features = ["pyo3"] } bzrformats_3.4.0.orig/crates/bazaar-py/src/0000755000000000000000000000000015162074037015621 5ustar00bzrformats_3.4.0.orig/crates/bazaar-py/src/chk_map.rs0000644000000000000000000000335615162074037017600 0ustar00use bazaar::chk_map::Key; use pyo3::prelude::*; use pyo3::types::PyBytes; use pyo3::wrap_pyfunction; #[pyfunction] fn _search_key_16(py: Python, key: Vec>) -> Bound { let key: Key = key.into(); let ret = bazaar::chk_map::search_key_16(&key); PyBytes::new(py, &ret) } #[pyfunction] fn _search_key_255(py: Python, key: Vec>) -> Bound { let key: Key = key.into(); let ret = bazaar::chk_map::search_key_255(&key); PyBytes::new(py, &ret) } #[pyfunction] fn _bytes_to_text_key(py: Python, key: Vec) -> PyResult<(Bound, Bound)> { let ret = bazaar::chk_map::bytes_to_text_key(key.as_slice()); if ret.is_err() { return Err(PyErr::new::( "Invalid key", )); } let ret = ret.unwrap(); Ok((PyBytes::new(py, ret.0), PyBytes::new(py, ret.1))) } #[pyfunction] fn common_prefix_pair<'a>(py: Python<'a>, key: &'a [u8], key2: &'a [u8]) -> Bound<'a, PyBytes> { PyBytes::new(py, bazaar::chk_map::common_prefix_pair(key, key2)) } #[pyfunction] fn common_prefix_many(py: Python, keys: Vec>) -> Option> { let keys = keys.iter().map(|v| v.as_slice()).collect::>(); bazaar::chk_map::common_prefix_many(keys.into_iter()) .as_ref() .map(|v| PyBytes::new(py, v)) } pub(crate) fn _chk_map_rs(py: Python) -> PyResult> { let m = PyModule::new(py, "chk_map")?; m.add_wrapped(wrap_pyfunction!(_search_key_16))?; m.add_wrapped(wrap_pyfunction!(_search_key_255))?; m.add_wrapped(wrap_pyfunction!(_bytes_to_text_key))?; m.add_wrapped(wrap_pyfunction!(common_prefix_pair))?; m.add_wrapped(wrap_pyfunction!(common_prefix_many))?; Ok(m) } bzrformats_3.4.0.orig/crates/bazaar-py/src/dirstate.rs0000644000000000000000000003210315162212430017774 0ustar00#![allow(non_snake_case)] use bazaar::FileId; use pyo3::exceptions::PyTypeError; use pyo3::prelude::*; use pyo3::types::{PyBytes, PyDict, PyList, PyString, PyTuple}; use pyo3::wrap_pyfunction; use std::ffi::OsString; #[cfg(unix)] use std::os::unix::ffi::OsStringExt; #[cfg(unix)] use std::os::unix::fs::{MetadataExt, PermissionsExt}; use std::path::{Path, PathBuf}; // TODO(jelmer): Shared pyo3 utils? fn extract_path(object: &Bound) -> PyResult { if let Ok(path) = object.extract::>() { #[cfg(unix)] { Ok(PathBuf::from(OsString::from_vec(path))) } #[cfg(not(unix))] { Ok(PathBuf::from( String::from_utf8(path) .map_err(|e| PyTypeError::new_err(e.to_string()))?, )) } } else if let Ok(path) = object.extract::() { Ok(path) } else { Err(PyTypeError::new_err("path must be a string or bytes")) } } /// Compare two paths directory by directory. /// /// This is equivalent to doing:: /// /// operator.lt(path1.split('/'), path2.split('/')) /// /// The idea is that you should compare path components separately. This /// differs from plain ``path1 < path2`` for paths like ``'a-b'`` and ``a/b``. /// "a-b" comes after "a" but would come before "a/b" lexically. /// /// Args: /// path1: first path /// path2: second path /// Returns: True if path1 comes first, otherwise False #[pyfunction] fn lt_by_dirs(path1: &Bound, path2: &Bound) -> PyResult { let path1 = extract_path(path1)?; let path2 = extract_path(path2)?; Ok(bazaar::dirstate::lt_by_dirs(&path1, &path2)) } /// Return the index where to insert path into paths. /// /// This uses the dirblock sorting. So all children in a directory come before /// the children of children. For example:: /// /// a/ /// b/ /// c /// d/ /// e /// b-c /// d-e /// a-a /// a=c /// /// Will be sorted as:: /// /// a /// a-a /// a=c /// a/b /// a/b-c /// a/d /// a/d-e /// a/b/c /// a/d/e /// /// Args: /// paths: A list of paths to search through /// path: A single path to insert /// Returns: An offset where 'path' can be inserted. /// See also: bisect.bisect_left #[pyfunction] fn bisect_path_left(paths: Vec>, path: &Bound) -> PyResult { let path = extract_path(path)?; let paths = paths .iter() .map(|x| extract_path(x).unwrap()) .collect::>(); let offset = bazaar::dirstate::bisect_path_left( paths .iter() .map(|x| x.as_path()) .collect::>() .as_slice(), &path, ); Ok(offset) } /// Return the index where to insert path into paths. /// /// This uses a path-wise comparison so we get:: /// a /// a-b /// a=b /// a/b /// Rather than:: /// a /// a-b /// a/b /// a=b /// /// Args: /// paths: A list of paths to search through /// path: A single path to insert /// Returns: An offset where 'path' can be inserted. /// See also: bisect.bisect_right #[pyfunction] fn bisect_path_right(paths: Vec>, path: &Bound) -> PyResult { let path = extract_path(path)?; let paths = paths .iter() .map(|x| extract_path(x).unwrap()) .collect::>(); let offset = bazaar::dirstate::bisect_path_right( paths .iter() .map(|x| x.as_path()) .collect::>() .as_slice(), &path, ); Ok(offset) } #[pyfunction] fn lt_path_by_dirblock(path1: &Bound, path2: &Bound) -> PyResult { let path1 = extract_path(path1)?; let path2 = extract_path(path2)?; Ok(bazaar::dirstate::lt_path_by_dirblock(&path1, &path2)) } #[pyfunction] #[pyo3(signature = (dirblocks, dirname, lo=None, hi=None, cache=None))] fn bisect_dirblock( py: Python, dirblocks: &Bound, dirname: &Bound, lo: Option, hi: Option, cache: Option>, ) -> PyResult { fn split_object(obj: &Bound) -> PyResult> { if let Ok(py_str) = obj.extract::>() { Ok(py_str .to_string() .split('/') .map(PathBuf::from) .collect::>()) } else if let Ok(py_bytes) = obj.extract::>() { Ok(py_bytes .as_bytes() .split(|&byte| byte == b'/') .map(|s| PathBuf::from(String::from_utf8_lossy(s).to_string())) .collect::>()) } else { Err(PyTypeError::new_err("Not a PyBytes or PyString")) } } let hi = hi.unwrap_or(dirblocks.len()); let cache = cache.unwrap_or_else(|| PyDict::new(py)); let dirname_split = match cache.get_item(dirname)? { Some(item) => item.extract::>()?, None => { let split = split_object(dirname)?; cache.set_item(dirname.clone(), split.clone())?; split } }; let mut lo = lo.unwrap_or(0); let mut hi = hi; while lo < hi { let mid = (lo + hi) / 2; let dirblock = dirblocks.get_item(mid)?.downcast_into::()?; let cur = dirblock.get_item(0)?; let cur_split = match cache.get_item(&cur)? { Some(item) => item.extract::>()?, None => { let split = split_object(&cur)?; cache.set_item(cur, split.clone())?; split } }; if cur_split < dirname_split { lo = mid + 1; } else { hi = mid; } } Ok(lo) } // TODO(jelmer): Move this into a more central place? #[pyclass] struct StatResult { metadata: std::fs::Metadata, } #[pymethods] impl StatResult { #[getter] fn st_size(&self) -> PyResult { Ok(self.metadata.len()) } #[getter] fn st_mtime(&self) -> PyResult { let modified = self .metadata .modified() .map_err(PyErr::new::)?; let since_epoch = modified .duration_since(std::time::UNIX_EPOCH) .map_err(|e| PyErr::new::(e.to_string()))?; Ok(since_epoch.as_secs()) } #[getter] fn st_ctime(&self) -> PyResult { let created = self .metadata .created() .map_err(PyErr::new::)?; let since_epoch = created .duration_since(std::time::UNIX_EPOCH) .map_err(|e| PyErr::new::(e.to_string()))?; Ok(since_epoch.as_secs()) } #[cfg(unix)] #[getter] fn st_mode(&self) -> PyResult { Ok(self.metadata.permissions().mode()) } #[cfg(not(unix))] #[getter] fn st_mode(&self) -> PyResult { Ok(0) } #[cfg(unix)] #[getter] fn st_dev(&self) -> PyResult { Ok(self.metadata.dev()) } #[cfg(unix)] #[getter] fn st_ino(&self) -> PyResult { Ok(self.metadata.ino()) } } #[pyclass] struct SHA1Provider { provider: Box, } #[pymethods] impl SHA1Provider { fn sha1<'a>(&mut self, py: Python<'a>, path: &Bound) -> PyResult> { let path = extract_path(path)?; let sha1 = self .provider .sha1(&path) .map_err(PyErr::new::)?; Ok(PyBytes::new(py, sha1.as_bytes())) } fn stat_and_sha1<'a>( &mut self, py: Python<'a>, path: &Bound, ) -> PyResult<(Py, Bound<'a, PyBytes>)> { let path = extract_path(path)?; let (md, sha1) = self.provider.stat_and_sha1(&path)?; let pmd = StatResult { metadata: md }; Ok(( pmd.into_pyobject(py)?.unbind().into(), PyBytes::new(py, sha1.as_bytes()), )) } } #[pyfunction] fn DefaultSHA1Provider() -> PyResult { Ok(SHA1Provider { provider: Box::new(bazaar::dirstate::DefaultSHA1Provider::new()), }) } fn extract_fs_time(obj: &Bound) -> PyResult { if let Ok(u) = obj.extract::() { Ok(u) } else if let Ok(u) = obj.extract::() { Ok(u as u64) } else { Err(PyTypeError::new_err("Not a float or int")) } } #[pyfunction] fn pack_stat<'a>(stat_result: &'a Bound<'a, PyAny>) -> PyResult> { let size = stat_result.getattr("st_size")?.extract::()?; let mtime = extract_fs_time(&stat_result.getattr("st_mtime")?)?; let ctime = extract_fs_time(&stat_result.getattr("st_ctime")?)?; let dev = stat_result.getattr("st_dev")?.extract::()?; let ino = stat_result.getattr("st_ino")?.extract::()?; let mode = stat_result.getattr("st_mode")?.extract::()?; let s = bazaar::dirstate::pack_stat(size, mtime, ctime, dev, ino, mode); Ok(PyBytes::new(stat_result.py(), s.as_bytes())) } #[pyfunction] fn fields_per_entry(num_present_parents: usize) -> usize { bazaar::dirstate::fields_per_entry(num_present_parents) } #[pyfunction] fn get_ghosts_line(py: Python, ghost_ids: Vec>) -> PyResult> { let ghost_ids = ghost_ids .iter() .map(|x| x.as_slice()) .collect::>(); let bs = bazaar::dirstate::get_ghosts_line(ghost_ids.as_slice()); Ok(PyBytes::new(py, bs.as_slice())) } #[pyfunction] fn get_parents_line(py: Python, parent_ids: Vec>) -> PyResult> { let parent_ids = parent_ids .iter() .map(|x| x.as_slice()) .collect::>(); let bs = bazaar::dirstate::get_parents_line(parent_ids.as_slice()); Ok(PyBytes::new(py, bs.as_slice())) } #[pyclass] struct IdIndex(bazaar::dirstate::IdIndex); #[pymethods] impl IdIndex { #[new] fn new() -> Self { IdIndex(bazaar::dirstate::IdIndex::new()) } fn add(&mut self, entry: (Vec, Vec, FileId)) -> PyResult<()> { self.0.add((&entry.0, &entry.1, &entry.2)); Ok(()) } fn remove(&mut self, entry: (Vec, Vec, FileId)) -> PyResult<()> { self.0.remove((&entry.0, &entry.1, &entry.2)); Ok(()) } fn get<'a>( &self, py: Python<'a>, file_id: FileId, ) -> PyResult, Bound<'a, PyBytes>, Bound<'a, PyBytes>)>> { let ret = self.0.get(&file_id); ret.iter() .map(|(a, b, c)| { Ok(( PyBytes::new(py, a), PyBytes::new(py, b), c.into_pyobject(py)?, )) }) .collect::>>() } fn iter_all<'py>( &self, py: Python<'py>, ) -> PyResult< Vec<( Bound<'py, PyBytes>, Bound<'py, PyBytes>, Bound<'py, PyBytes>, )>, > { let ret = self.0.iter_all(); ret.map(|(a, b, c)| { Ok(( PyBytes::new(py, a), PyBytes::new(py, b), c.into_pyobject(py)?, )) }) .collect::>>() } fn file_ids<'a>(&self, py: Python<'a>) -> PyResult>> { self.0.file_ids().map(|x| x.into_pyobject(py)).collect() } } #[pyfunction] fn inv_entry_to_details<'a>( py: Python<'a>, e: &'a crate::inventory::InventoryEntry, ) -> ( Bound<'a, PyBytes>, Bound<'a, PyBytes>, u64, bool, Bound<'a, PyBytes>, ) { let ret = bazaar::dirstate::inv_entry_to_details(&e.0); ( PyBytes::new(py, &[ret.0]), PyBytes::new(py, ret.1.as_slice()), ret.2, ret.3, PyBytes::new(py, ret.4.as_slice()), ) } #[pyfunction] fn get_output_lines(py: Python<'_>, lines: Vec>) -> Vec> { let lines = lines.iter().map(|x| x.as_slice()).collect::>(); bazaar::dirstate::get_output_lines(lines) .into_iter() .map(|x| PyBytes::new(py, x.as_slice())) .collect() } /// Helpers for the dirstate module. pub fn _dirstate_rs(py: Python) -> PyResult> { let m = PyModule::new(py, "dirstate")?; m.add_wrapped(wrap_pyfunction!(lt_by_dirs))?; m.add_wrapped(wrap_pyfunction!(bisect_path_left))?; m.add_wrapped(wrap_pyfunction!(bisect_path_right))?; m.add_wrapped(wrap_pyfunction!(lt_path_by_dirblock))?; m.add_wrapped(wrap_pyfunction!(bisect_dirblock))?; m.add_wrapped(wrap_pyfunction!(DefaultSHA1Provider))?; m.add_wrapped(wrap_pyfunction!(pack_stat))?; m.add_wrapped(wrap_pyfunction!(fields_per_entry))?; m.add_wrapped(wrap_pyfunction!(get_ghosts_line))?; m.add_wrapped(wrap_pyfunction!(get_parents_line))?; m.add_class::()?; m.add_wrapped(wrap_pyfunction!(inv_entry_to_details))?; m.add_wrapped(wrap_pyfunction!(get_output_lines))?; Ok(m) } bzrformats_3.4.0.orig/crates/bazaar-py/src/groupcompress.rs0000644000000000000000000003335315162074037021106 0ustar00use bazaar::groupcompress::compressor::GroupCompressor; use bazaar::versionedfile::Key; use pyo3::exceptions::{PyRuntimeError, PyValueError}; use pyo3::prelude::*; use pyo3::types::PyBytes; use pyo3::wrap_pyfunction; use std::borrow::Cow; use std::convert::TryInto; #[pyfunction] fn encode_base128_int(py: Python, value: u128) -> PyResult> { let ret = bazaar::groupcompress::delta::encode_base128_int(value); Ok(PyBytes::new(py, &ret)) } #[pyfunction] fn decode_base128_int(value: Vec) -> PyResult<(u128, usize)> { Ok(bazaar::groupcompress::delta::decode_base128_int(&value)) } #[pyfunction] fn apply_delta(py: Python, basis: Vec, delta: Vec) -> PyResult> { bazaar::groupcompress::delta::apply_delta(&basis, &delta) .map_err(|e| PyErr::new::(format!("Invalid delta: {}", e))) .map(|x| PyBytes::new(py, &x)) } #[pyfunction] fn decode_copy_instruction(data: Vec, cmd: u8, pos: usize) -> PyResult<(usize, usize, usize)> { let ret = bazaar::groupcompress::delta::decode_copy_instruction(&data, cmd, pos); if ret.is_err() { return Err(PyErr::new::( "Invalid copy instruction", )); } let ret = ret.unwrap(); Ok((ret.0, ret.1, ret.2)) } #[pyfunction] #[pyo3(signature = (source, delta_start, delta_end))] fn apply_delta_to_source<'a>( py: Python<'a>, source: &'a [u8], delta_start: usize, delta_end: usize, ) -> PyResult> { bazaar::groupcompress::delta::apply_delta_to_source(source, delta_start, delta_end) .map_err(|e| PyErr::new::(format!("Invalid delta: {}", e))) .map(|x| PyBytes::new(py, &x)) } #[pyfunction] fn encode_copy_instruction(py: Python, offset: usize, length: usize) -> PyResult> { let ret = bazaar::groupcompress::delta::encode_copy_instruction(offset, length); Ok(PyBytes::new(py, &ret)) } #[pyfunction] fn make_line_delta<'a>( py: Python<'a>, source_bytes: &'a [u8], target_bytes: &'a [u8], ) -> Bound<'a, PyBytes> { PyBytes::new( py, bazaar::groupcompress::line_delta::make_delta(source_bytes, target_bytes) .flat_map(|x| x.into_owned()) .collect::>() .as_slice(), ) } #[pyfunction] fn make_rabin_delta<'a>( py: Python<'a>, source_bytes: &'a [u8], target_bytes: &'a [u8], ) -> Bound<'a, PyBytes> { PyBytes::new( py, bazaar::groupcompress::rabin_delta::make_delta(source_bytes, target_bytes).as_slice(), ) } #[pyclass] pub struct LinesDeltaIndex(bazaar::groupcompress::line_delta::LinesDeltaIndex); #[pymethods] impl LinesDeltaIndex { #[new] fn new(lines: Vec>) -> Self { let index = bazaar::groupcompress::line_delta::LinesDeltaIndex::new(lines); Self(index) } #[getter] fn lines<'a>(&self, py: Python<'a>) -> Vec> { self.0 .lines() .iter() .map(|x| PyBytes::new(py, x.as_ref())) .collect() } #[pyo3(signature = (source, bytes_length, soft = None))] fn make_delta<'a>( &'a self, py: Python<'a>, source: Vec>>, bytes_length: usize, soft: Option, ) -> (Vec>, Vec) { let source: Vec> = source .iter() .map(|x| Cow::Owned(x.iter().flatten().copied().collect::>())) .collect::>(); let (delta, index) = self.0.make_delta(source.as_slice(), bytes_length, soft); ( delta .into_iter() .map(|x| PyBytes::new(py, x.as_ref())) .collect(), index, ) } fn extend_lines(&mut self, lines: Vec>, index: Vec) -> PyResult<()> { self.0.extend_lines(lines.as_slice(), index.as_slice()); Ok(()) } #[getter] fn endpoint(&self) -> usize { self.0.endpoint() } } #[pyclass(unsendable)] struct GroupCompressBlock(bazaar::groupcompress::block::GroupCompressBlock); #[pymethods] impl GroupCompressBlock { #[new] fn new() -> Self { Self(bazaar::groupcompress::block::GroupCompressBlock::new()) } fn __len__(&self) -> usize { self.0.len() } #[getter] fn _z_content<'a>(&mut self, py: Python<'a>) -> PyResult> { let ret = self.0.z_content(); Ok(PyBytes::new(py, &ret)) } #[getter] fn _content<'a>(&mut self, py: Python<'a>) -> PyResult>> { let ret = self.0.content(); Ok(ret.map(|x| PyBytes::new(py, x))) } #[getter] fn _content_length(&self) -> Option { self.0.content_length() } #[classmethod] fn from_bytes(_type: &pyo3::Bound, data: &[u8]) -> PyResult { let ret = bazaar::groupcompress::block::GroupCompressBlock::from_bytes(data); if ret.is_err() { return Err(PyErr::new::( "Invalid block", )); } Ok(Self(ret.unwrap())) } fn extract<'a>( &mut self, py: Python<'a>, _key: Py, offset: usize, length: usize, ) -> PyResult>> { let chunks = self .0 .extract(offset, length) .map_err(|e| PyValueError::new_err(format!("Error during extract: {:?}", e)))?; Ok(chunks .into_iter() .map(|x| PyBytes::new(py, x.as_ref())) .collect()) } fn set_chunked_content(&mut self, data: Vec>, length: usize) -> PyResult<()> { self.0.set_chunked_content(data.as_slice(), length); Ok(()) } #[pyo3(signature = (kind = None))] fn to_chunks<'a>( &mut self, py: Python<'a>, kind: Option, ) -> (usize, Vec>) { let (size, chunks) = self.0.to_chunks(kind); let chunks = chunks .into_iter() .map(|x| PyBytes::new(py, x.as_ref())) .collect(); (size, chunks) } fn to_bytes<'a>(&mut self, py: Python<'a>) -> PyResult> { let ret = self.0.to_bytes(); Ok(PyBytes::new(py, &ret)) } #[pyo3(signature = (size = None))] fn _ensure_content(&mut self, size: Option) -> PyResult<()> { self.0.ensure_content(size); Ok(()) } #[pyo3(signature = (include_text = None))] fn _dump(&mut self, py: Python, include_text: Option) -> PyResult> { let ret = self .0 .dump(include_text) .map_err(|e| PyValueError::new_err(format!("Error during dump: {:?}", e)))?; Ok(ret .into_iter() .map(|x| match x { bazaar::groupcompress::block::DumpInfo::Fulltext(text) => ( PyBytes::new(py, b"f"), 0usize, text.map(|x| PyBytes::new(py, x.as_ref()).unbind().into()), ), bazaar::groupcompress::block::DumpInfo::Delta(decomp_len, info) => ( PyBytes::new(py, b"d"), decomp_len, Some( info.into_iter() .map(|x| match x { bazaar::groupcompress::block::DeltaInfo::Copy( offset, len, text, ) => ( offset, len, text.map(|x| PyBytes::new(py, x.as_ref()).unbind()), ) .into_pyobject(py) .unwrap() .unbind(), bazaar::groupcompress::block::DeltaInfo::Insert(len, data) => ( 0usize, len, data.map(|x| PyBytes::new(py, x.as_slice()).unbind()), ) .into_pyobject(py) .unwrap() .unbind(), }) .collect::>() .into_pyobject(py) .unwrap() .unbind(), ), ), }) .collect::>() .into_pyobject(py)? .unbind()) } } #[pyclass] struct TraditionalGroupCompressor( Option, ); #[pymethods] impl TraditionalGroupCompressor { #[new] #[allow(unused_variables)] #[pyo3(signature = (settings = None))] fn new(settings: Option>) -> Self { Self(Some( bazaar::groupcompress::compressor::TraditionalGroupCompressor::new(), )) } #[getter] fn chunks<'a>(&self, py: Python<'a>) -> PyResult>> { if let Some(c) = self.0.as_ref() { Ok(c.chunks() .iter() .map(|x| PyBytes::new(py, x.as_ref())) .collect()) } else { Err(PyRuntimeError::new_err("Compressor is already finalized")) } } #[getter] fn endpoint(&self) -> PyResult { if let Some(c) = self.0.as_ref() { Ok(c.endpoint()) } else { Err(PyRuntimeError::new_err("Compressor is already finalized")) } } fn ratio(&self) -> PyResult { if let Some(c) = self.0.as_ref() { Ok(c.ratio()) } else { Err(PyRuntimeError::new_err("Compressor is already finalized")) } } fn extract<'a>( &self, py: Python<'a>, key: Vec>, ) -> PyResult<(Vec>, Bound<'a, PyBytes>)> { if let Some(c) = self.0.as_ref() { let (data, hash) = c .extract(&key) .map_err(|e| PyValueError::new_err(format!("Error during extract: {:?}", e)))?; Ok(( data.iter().map(|x| PyBytes::new(py, x.as_ref())).collect(), PyBytes::new(py, hash.as_bytes()), )) } else { Err(PyRuntimeError::new_err("Compressor is already finalized")) } } fn flush<'a>(&mut self, py: Python<'a>) -> PyResult<(Vec>, usize)> { if let Some(c) = self.0.take() { let (chunks, endpoint) = c.flush(); Ok(( chunks .into_iter() .map(|x| PyBytes::new(py, x.as_ref())) .collect(), endpoint, )) } else { Err(PyRuntimeError::new_err("Compressor is already finalized")) } } fn flush_without_last<'a>( &mut self, py: Python<'a>, ) -> PyResult<(Vec>, usize)> { if let Some(c) = self.0.take() { let (chunks, endpoint) = c.flush_without_last(); Ok(( chunks .into_iter() .map(|x| PyBytes::new(py, x.as_ref())) .collect(), endpoint, )) } else { Err(PyRuntimeError::new_err("Compressor is already finalized")) } } #[pyo3(signature = (key, chunks, length, expected_sha = None, nostore_sha = None, soft = None))] fn compress<'a>( &mut self, py: Python<'a>, key: Key, chunks: Vec>, length: usize, expected_sha: Option, nostore_sha: Option, soft: Option, ) -> PyResult<(Bound<'a, PyBytes>, usize, usize, &'a str)> { let chunks_l = chunks.iter().map(|x| x.as_slice()).collect::>(); if let Some(c) = self.0.as_mut() { c.compress( &key, chunks_l.as_slice(), length, expected_sha, nostore_sha, soft, ) .map_err(|e| PyValueError::new_err(format!("Error during compress: {:?}", e))) .map(|(hash, size, chunks, kind)| (PyBytes::new(py, hash.as_ref()), size, chunks, kind)) } else { Err(PyRuntimeError::new_err("Compressor is already finalized")) } } } #[pyfunction] fn rabin_hash(data: Vec) -> PyResult { Ok(bazaar::groupcompress::rabin_delta::rabin_hash( data.try_into() .map_err(|e| PyValueError::new_err(format!("Error during rabin_hash: {:?}", e)))?, ) .into()) } pub(crate) fn _groupcompress_rs(py: Python) -> PyResult> { let m = PyModule::new(py, "groupcompress")?; m.add_wrapped(wrap_pyfunction!(encode_base128_int))?; m.add_wrapped(wrap_pyfunction!(decode_base128_int))?; m.add_wrapped(wrap_pyfunction!(apply_delta))?; m.add_wrapped(wrap_pyfunction!(decode_copy_instruction))?; m.add_wrapped(wrap_pyfunction!(encode_copy_instruction))?; m.add_wrapped(wrap_pyfunction!(apply_delta_to_source))?; m.add_wrapped(wrap_pyfunction!(make_line_delta))?; m.add_wrapped(wrap_pyfunction!(make_rabin_delta))?; m.add_wrapped(wrap_pyfunction!(rabin_hash))?; m.add_class::()?; m.add_class::()?; m.add( "NULL_SHA1", pyo3::types::PyBytes::new(py, &bazaar::groupcompress::NULL_SHA1), )?; Ok(m) } bzrformats_3.4.0.orig/crates/bazaar-py/src/hashcache.rs0000644000000000000000000001515415162212430020073 0ustar00use bazaar::filters::ContentFilter; use pyo3::prelude::*; use pyo3::types::PyBytes; use std::fs::Permissions; use std::io::Error; #[cfg(unix)] use std::os::unix::fs::PermissionsExt; use std::path::Path; #[pyclass] struct HashCache { hashcache: Box, } pub struct PyContentFilter { content_filter: Py, } #[pyclass] struct PyChunkIterator { input: Box, Error>> + Send + Sync>, } #[pymethods] impl PyChunkIterator { fn __next__<'py>(&mut self, py: Python<'py>) -> PyResult>> { match self.input.next() { Some(Ok(item)) => Ok(Some(PyBytes::new(py, &item))), Some(Err(e)) => Err(e.into()), None => Ok(None), } } } fn map_py_err_to_io_err(e: PyErr) -> Error { Error::new(std::io::ErrorKind::Other, e.to_string()) } fn map_py_err_to_iter_io_err( e: PyErr, ) -> Box, Error>> + Send + Sync> { Box::new(std::iter::once(Err(map_py_err_to_io_err(e)))) } impl PyContentFilter { fn _impl( &self, input: Box, Error>> + Send + Sync>, worker: &str, ) -> Box, Error>> + Send + Sync> { Python::attach(|py| { let worker = self.content_filter.getattr(py, worker); let py_input = PyChunkIterator { input }; let py_output = worker.unwrap().call1(py, (py_input,)); if let Err(e) = py_output { return map_py_err_to_iter_io_err(e); } let py_output = py_output.unwrap(); let next = move || { Python::attach(|py| { let item = py_output.call_method0(py, "__next__"); match item { Err(e) => Some(Err(map_py_err_to_io_err(e))), Ok(item) => { if item.is_none(py) { None } else { Some(Ok(item.extract(py).map_err(map_py_err_to_io_err).unwrap())) } } } }) }; Box::new(std::iter::from_fn(next)) }) } } impl ContentFilter for PyContentFilter { fn reader( &self, input: Box, Error>> + Send + Sync>, ) -> Box, Error>> + Send + Sync> { self._impl(input, "reader") } fn writer( &self, input: Box, Error>> + Send + Sync>, ) -> Box, Error>> + Send + Sync> { self._impl(input, "worker") } } fn content_filter_to_fn( content_filter_provider: Py, ) -> Box Box + Send + Sync> { Box::new(move |path, ctime| { Python::attach(|py| { Box::new(PyContentFilter { content_filter: content_filter_provider.call1(py, (path, ctime)).unwrap(), }) }) }) } fn extract_fs_time(obj: &Bound) -> Result { if let Ok(val) = obj.extract::() { Ok(val) } else if let Ok(val) = obj.extract::() { Ok(val as i64) } else { Err(PyErr::new::( "Expected int or float", )) } } #[pymethods] impl HashCache { #[new] #[pyo3(signature = ( root, cache_file_name, mode = None, content_filter_provider = None ))] fn new( root: &str, cache_file_name: &str, mode: Option, content_filter_provider: Option>, ) -> Self { Self { hashcache: Box::new(bazaar::hashcache::HashCache::new( Path::new(root), Path::new(cache_file_name), { #[cfg(unix)] { mode.map(Permissions::from_mode) } #[cfg(not(unix))] { let _ = mode; None } }, content_filter_provider.map(content_filter_to_fn), )), } } fn cache_file_name(&self) -> &str { self.hashcache.cache_file_name().to_str().unwrap() } fn clear(&mut self) { self.hashcache.clear(); } fn scan(&mut self) { self.hashcache.scan(); } #[pyo3(signature = (path, stat_value = None))] fn get_sha1<'a>( &mut self, py: Python<'a>, path: &str, stat_value: Option>, ) -> PyResult> { let sha1; if let Some(stat_value) = stat_value { let fp = bazaar::hashcache::Fingerprint { size: stat_value.getattr("st_size")?.extract()?, mtime: extract_fs_time(&stat_value.getattr("st_mtime")?)?, ctime: extract_fs_time(&stat_value.getattr("st_ctime")?)?, ino: stat_value.getattr("st_ino")?.extract()?, dev: stat_value.getattr("st_dev")?.extract()?, mode: stat_value.getattr("st_mode")?.extract()?, }; sha1 = self .hashcache .get_sha1_by_fingerprint(Path::new(path), &fp)?; } else { let ret = self.hashcache.get_sha1(Path::new(path), None)?; if let Some(s) = ret { sha1 = s; } else { return Ok(py.None().into_bound(py)); } } Ok(PyBytes::new(py, sha1.as_bytes()).into_any()) } fn write(&mut self) -> PyResult<()> { self.hashcache.write().map_err(|e| e.into()) } fn read(&mut self) -> PyResult<()> { self.hashcache.read().map_err(|e| e.into()) } fn cutoff_time(&self) -> i64 { self.hashcache.cutoff_time() } fn set_cutoff_offset(&mut self, offset: i64) { self.hashcache.set_cutoff_offset(offset); } #[getter] fn miss_count(&self) -> u32 { self.hashcache.miss_count() } #[getter] fn hit_count(&self) -> u32 { self.hashcache.hit_count() } #[getter] fn needs_write(&self) -> bool { self.hashcache.needs_write() } fn fingerprint(&self, abspath: &str) -> Option<(u64, i64, i64, u64, u64, u32)> { let fp = self.hashcache.fingerprint(Path::new(abspath), None); fp.map(|fp| (fp.size, fp.mtime, fp.ctime, fp.ino, fp.dev, fp.mode)) } } pub(crate) fn hashcache(m: &Bound) -> PyResult<()> { m.add_class::()?; Ok(()) } bzrformats_3.4.0.orig/crates/bazaar-py/src/inventory.rs0000644000000000000000000016625715162074037020245 0ustar00use bazaar::inventory::{describe_change, detect_changes, Entry, Error, Inventory as _}; use bazaar::inventory_delta::{ InventoryDeltaEntry, InventoryDeltaInconsistency, InventoryDeltaParseError, InventoryDeltaSerializeError, }; use bazaar::{FileId, RevisionId}; use osutils::Kind; use pyo3::class::basic::CompareOp; use pyo3::exceptions::{ PyIndexError, PyKeyError, PyNotImplementedError, PyTypeError, PyValueError, }; use pyo3::prelude::*; use pyo3::pyclass_init::PyClassInitializer; use pyo3::types::{PyBytes, PyDict, PyString}; use pyo3::wrap_pyfunction; use pyo3::{create_exception, import_exception}; use std::collections::HashMap; use std::collections::HashSet; use std::collections::VecDeque; use std::iter::FromIterator; import_exception!(bzrformats.inventory, InvalidEntryName); import_exception!(bzrformats.inventory, DuplicateFileId); import_exception!(bzrformats.inventory, NoSuchId); import_exception!(bzrformats.errors, BzrCheckError); import_exception!(bzrformats.errors, InvalidNormalization); import_exception!(bzrformats.errors, InconsistentDelta); import_exception!(bzrformats.errors, AlreadyVersionedError); import_exception!(bzrformats.errors, BzrFormatsError); import_exception!(bzrformats.errors, NotADirectory); import_exception!(bzrformats.errors, NotVersionedError); create_exception!( bzrformats.inventory_delta, IncompatibleInventoryDelta, BzrFormatsError ); create_exception!(bzrformats.inventory_delta, InventoryDeltaError, BzrFormatsError); fn kind_from_str(kind: &str) -> Option { match kind { "file" => Some(Kind::File), "directory" => Some(Kind::Directory), "tree-reference" => Some(Kind::TreeReference), "symlink" => Some(Kind::Symlink), _ => None, } } fn check_name(name: &str) -> PyResult<()> { if !is_valid_name(name) { Err(InvalidEntryName::new_err((name.to_string(),))) } else { Ok(()) } } fn common_ie_check( slf: Py, ie: &Entry, py: Python, checker: &Py, rev_id: &RevisionId, inv: Py, ) -> PyResult<()> { if let Some(parent_id) = ie.parent_id() { let present = inv .call_method1(py, "has_id", (parent_id,))? .extract::(py)?; if !present { return Err(BzrCheckError::new_err(format!( "missing parent {{{}}} in inventory for revision {{{}}}", parent_id, rev_id ))); } } checker.call_method1(py, "_add_entry_to_text_key_references", (inv, slf))?; Ok(()) } #[pyclass(subclass)] pub struct InventoryEntry(pub Entry); #[pymethods] impl InventoryEntry { fn has_text(&self) -> bool { matches!(&self.0, Entry::File { .. }) } fn kind_character(&self) -> &'static str { self.0.kind().marker() } #[getter] fn kind(&self) -> &'static str { self.0.kind().to_string() } #[getter] fn get_name(&self) -> &str { match &self.0 { Entry::File { name, .. } => name, Entry::Directory { name, .. } => name, Entry::TreeReference { name, .. } => name, Entry::Link { name, .. } => name, Entry::Root { .. } => "", } } #[getter] fn get_file_id<'a>(&self, py: Python<'a>) -> PyResult> { let file_id = self.0.file_id(); file_id.into_pyobject(py) } #[getter] fn get_parent_id<'py>(&self, py: Python<'py>) -> Option> { let parent_id = self.0.parent_id(); parent_id.map(|parent_id| parent_id.into_pyobject(py).unwrap()) } #[getter] fn get_revision<'py>(&self, py: Python<'py>) -> Option> { let revision = self.0.revision(); revision .as_ref() .map(|revision| revision.into_pyobject(py).unwrap()) } #[staticmethod] fn versionable_kind(kind: &str) -> bool { if let Some(kind) = kind_from_str(kind) { bazaar::inventory::versionable_kind(kind) } else { false } } #[getter] fn get_executable(&self) -> bool { match &self.0 { Entry::File { executable, .. } => *executable, _ => false, } } fn is_unmodified(&self, other: &InventoryEntry) -> bool { self.0.is_unmodified(&other.0) } fn detect_changes(&self, other: &InventoryEntry) -> (bool, bool) { detect_changes(&self.0, &other.0) } #[staticmethod] #[pyo3(signature = (slf=None, other=None))] fn describe_change(slf: Option<&InventoryEntry>, other: Option<&InventoryEntry>) -> String { describe_change(slf.map(|s| &s.0), other.map(|o| &o.0)).to_string() } fn __richcmp__(&self, other: &InventoryEntry, op: CompareOp) -> PyResult { match op { CompareOp::Eq => Ok(self.0 == other.0), CompareOp::Ne => Ok(self.0 != other.0), _ => Err(PyNotImplementedError::new_err("")), } } fn _unchanged(&self, other: &InventoryEntry) -> bool { self.0.unchanged(&other.0) } #[pyo3(signature = (revision=None, name=None, parent_id=None))] fn derive( &self, revision: Option, name: Option, parent_id: Option, ) -> InventoryEntry { let mut entry = self.0.clone(); let revision = revision.or_else(|| entry.revision().cloned()); let name = name.unwrap_or_else(|| entry.name().to_string()); let parent_id = parent_id.or_else(|| entry.parent_id().cloned()); match &mut entry { Entry::File { revision: r, name: n, parent_id: p, .. } => { *r = revision; *n = name; *p = parent_id.unwrap(); } Entry::Directory { revision: r, name: n, parent_id: p, .. } => { *r = revision; *n = name; *p = parent_id.unwrap(); } Entry::TreeReference { revision: r, name: n, parent_id: p, .. } => { *r = revision; *n = name; *p = parent_id.unwrap(); } Entry::Link { revision: r, name: n, parent_id: p, .. } => { *r = revision; *n = name; *p = parent_id.unwrap(); } Entry::Root { revision: r, .. } => { *r = revision; } } InventoryEntry(entry) } /// Find possible per-file graph parents. /// /// This is currently defined by: /// Select the last changed revision in the parent inventory. /// Do deal with a short lived bug in bzr 0.8's development two entries /// that have the same last changed but different 'x' bit settings are /// changed in-place. fn parent_candidates<'py>( &self, py: Python<'py>, previous_inventories: Vec>, ) -> PyResult> { // revision:ie mapping for each ie found in previous_inventories let mut candidates: HashMap<&RevisionId, Py> = HashMap::new(); // identify candidate head revision ids for inv in previous_inventories { match inv.call_method1(py, "get_entry", (self.get_file_id(py)?,)) { Ok(py_entry) => { if let Ok(mut entry) = py_entry.extract::>(py) { if let Some(revision) = entry.0.revision() { if let Some(candidate) = candidates.get_mut(revision) { // same revision value in two different inventories: // correct possible inconsistencies: // * there was a bug in revision updates with executable bit support let mut candidate = candidate.extract::>(py)?; if let ( Entry::File { executable: candidate_executable, .. }, Entry::File { executable: entry_executable, .. }, ) = (&mut candidate.0, &mut entry.0) { if candidate_executable != entry_executable { *entry_executable = false; *candidate_executable = false; } } } else { // add this revision as a candidate. //candidates.insert(revision, py_entry); } } } } Err(e) if e.is_instance_of::(py) => {} Err(e) => { return Err(e); } } } let ret = PyDict::new(py); for (revision, entry) in candidates.into_iter() { ret.set_item(revision, entry)?; } Ok(ret) } } #[pyclass(subclass,extends=InventoryEntry)] struct InventoryFile(); #[pymethods] impl InventoryFile { #[new] #[pyo3(signature = (file_id, name, parent_id, revision=None, text_sha1=None, text_size=None, executable=None, text_id=None))] fn new( file_id: FileId, name: String, parent_id: FileId, revision: Option, text_sha1: Option>, text_size: Option, executable: Option, text_id: Option>, ) -> PyResult<(Self, InventoryEntry)> { let executable = executable.unwrap_or(false); check_name(name.as_str())?; let entry = Entry::File { file_id, name, parent_id, revision, text_sha1, text_size, text_id, executable, }; Ok((Self(), InventoryEntry(entry))) } #[getter] fn get_executable(slf: PyRef) -> bool { match slf.into_super().0 { Entry::File { executable, .. } => executable, _ => false, } } #[getter] fn get_text_sha1(slf: PyRef, py: Python) -> Option> { let s = slf.into_super(); match &s.0 { Entry::File { text_sha1, .. } => text_sha1 .as_ref() .map(|text_sha1| PyBytes::new(py, text_sha1.as_ref()).into()), _ => panic!("Not a file"), } } #[getter] fn get_text_size(slf: PyRef) -> Option { let s = slf.into_super(); match &s.0 { Entry::File { text_size, .. } => *text_size, _ => panic!("Not a file"), } } #[getter] fn get_text_id(slf: PyRef, py: Python) -> Option> { let s = slf.into_super(); match &s.0 { Entry::File { text_id, .. } => text_id .as_ref() .map(|text_id| PyBytes::new(py, text_id).into()), _ => panic!("Not a file"), } } #[getter] fn get_reference_revision(_slf: PyRef, py: Python) -> Py { py.None() } fn copy<'a>(slf: PyRef<'a, Self>, py: Python<'a>) -> PyResult> { let s = slf.into_super(); let init = PyClassInitializer::from(InventoryEntry(s.0.clone())); let init = init.add_subclass(Self()); Bound::new(py, init) } fn __repr__(slf: PyRef, py: Python) -> PyResult { let s = slf.into_super(); Ok(match &s.0 { Entry::File { name, file_id, parent_id, text_sha1, text_size, revision, .. } => format!( "InventoryFile({}, {}, parent_id={}, sha1={}, len={}, revision={})", file_id.into_pyobject(py).unwrap().repr()?, name.into_pyobject(py).unwrap().repr()?, parent_id.into_pyobject(py).unwrap().repr()?, text_sha1 .as_ref() .map(|s| PyBytes::new(py, s.as_slice()).repr()) .unwrap_or_else(|| Ok(PyString::new(py, "None")))?, text_size.into_pyobject(py).unwrap().repr()?, revision .as_ref() .map(|r| r.into_pyobject(py).unwrap()) .into_pyobject(py) .unwrap() .repr()?, ), _ => panic!("Not a file"), }) } fn check( slf: &Bound, py: Python, checker: Py, rev_id: RevisionId, inv: Py, ) -> PyResult<()> { let spr = slf.borrow().into_super(); common_ie_check( slf.clone().unbind().into(), &spr.0, py, &checker, &rev_id, inv, )?; let (file_id, revision, text_sha1, text_size) = match spr.0 { Entry::File { ref text_sha1, ref file_id, ref revision, text_size, .. } => (file_id, revision, text_sha1, text_size), _ => panic!("Not a file"), }; checker.call_method1( py, "add_pending_item", ( &rev_id, ("texts", &file_id, &revision), PyBytes::new(py, b"text"), PyBytes::new(py, text_sha1.as_ref().unwrap()), ), )?; if text_size.is_none() { checker.getattr(py, "_report_items")?.call_method1( py, "append", (format!( "fileid {{{}}} in {{{}}} has None for text_size", file_id, rev_id ),), )?; } Ok(()) } } #[pyclass(subclass,extends=InventoryEntry)] struct InventoryDirectory(); #[pymethods] impl InventoryDirectory { #[new] #[pyo3(signature = (file_id, name, parent_id=None, revision=None))] fn new( file_id: FileId, name: String, parent_id: Option, revision: Option, ) -> PyResult<(Self, InventoryEntry)> { check_name(name.as_str())?; let entry = if let Some(parent_id) = parent_id { Entry::Directory { file_id, name, parent_id, revision, } } else { Entry::Root { file_id, revision } }; Ok((Self(), InventoryEntry(entry))) } fn copy<'py>(slf: PyRef, py: Python<'py>) -> PyResult> { let s = slf.into_super(); let init = PyClassInitializer::from(InventoryEntry(s.0.clone())); let init = init.add_subclass(Self()); Bound::new(py, init) } #[getter] fn get_text_size(&self, py: Python) -> Py { py.None() } #[getter] fn get_text_sha1(&self, py: Python) -> Py { py.None() } fn __repr__(slf: PyRef, py: Python) -> PyResult { let s = slf.into_super(); Ok(match &s.0 { Entry::Directory { name, file_id, parent_id, revision, .. } => format!( "InventoryDirectory({}, {}, parent_id={}, revision={})", file_id.into_pyobject(py).unwrap().repr()?, name.into_pyobject(py).unwrap().repr()?, parent_id.into_pyobject(py).unwrap().repr()?, revision.into_pyobject(py).unwrap().repr()?, ), Entry::Root { file_id, revision, .. } => format!( "InventoryDirectory({}, \"\", parent_id=None, revision={})", file_id.into_pyobject(py).unwrap().repr()?, revision.into_pyobject(py).unwrap().repr()?, ), _ => panic!("Not a directory"), }) } fn check( slf: &Bound, py: Python, checker: Py, rev_id: RevisionId, inv: Py, ) -> PyResult<()> { let spr = slf.borrow().into_super(); common_ie_check( slf.clone().unbind().into(), &spr.0, py, &checker, &rev_id, inv, )?; // In non rich root repositories we do not expect a file graph for the // root. if spr.0.name().is_empty() && !checker.getattr(py, "rich_roots")?.extract::(py)? { return Ok(()); } // Directories are stored as an empty file, but the file should exist // to provide a per-fileid log. The hash of every directory content is // "da..." below (the sha1sum of ''). checker.call_method1( py, "add_pending_item", ( &rev_id, ("texts", spr.0.file_id(), spr.0.revision()), PyBytes::new(py, b"text"), PyBytes::new(py, b"da39a3ee5e6b4b0d3255bfef95601890afd80709"), ), )?; Ok(()) } } #[pyclass(subclass,extends=InventoryEntry)] struct TreeReference(); #[pymethods] impl TreeReference { #[new] #[pyo3(signature = (file_id, name, parent_id, revision=None, reference_revision=None))] fn new( file_id: FileId, name: String, parent_id: FileId, revision: Option, reference_revision: Option, ) -> PyResult<(Self, InventoryEntry)> { check_name(name.as_str())?; let entry = Entry::TreeReference { file_id, name, parent_id, revision, reference_revision, }; Ok((Self(), InventoryEntry(entry))) } #[getter] fn get_reference_revision<'a>( slf: PyRef<'a, Self>, py: Python<'a>, ) -> Option> { let s = slf.into_super(); match &s.0 { Entry::TreeReference { reference_revision, .. } => reference_revision .as_ref() .map(|reference_revision| reference_revision.into_pyobject(py).unwrap()), _ => panic!("Not a tree reference"), } } fn copy<'py>(slf: PyRef, py: Python<'py>) -> PyResult> { let s = slf.into_super(); let init = PyClassInitializer::from(InventoryEntry(s.0.clone())); let init = init.add_subclass(Self()); Bound::new(py, init) } } #[pyclass(subclass,extends=InventoryEntry)] struct InventoryLink(); #[pymethods] impl InventoryLink { #[new] #[pyo3(signature = (file_id, name, parent_id, revision=None, symlink_target=None))] fn new( file_id: FileId, name: String, parent_id: FileId, revision: Option, symlink_target: Option, ) -> PyResult<(Self, InventoryEntry)> { check_name(name.as_str())?; let entry = Entry::Link { file_id, name, parent_id, symlink_target, revision, }; Ok((Self(), InventoryEntry(entry))) } #[getter] fn get_symlink_target(slf: PyRef) -> Option { let s = slf.into_super(); match s.0 { Entry::Link { ref symlink_target, .. } => symlink_target.clone(), _ => panic!("Not a link"), } } fn copy<'py>(slf: PyRef, py: Python<'py>) -> PyResult> { let s = slf.into_super(); let init = PyClassInitializer::from(InventoryEntry(s.0.clone())); let init = init.add_subclass(Self()); Bound::new(py, init) } #[getter] fn get_text_size(&self, py: Python) -> Py { py.None() } #[getter] fn get_text_sha1(&self, py: Python) -> Py { py.None() } fn check( slf: &Bound, py: Python, checker: Py, rev_id: RevisionId, inv: Py, ) -> PyResult<()> { let spr = slf.borrow().into_super(); common_ie_check( slf.clone().unbind().into(), &spr.0, py, &checker, &rev_id, inv, )?; if spr.0.symlink_target().is_none() { let report_items = checker.getattr(py, "_report_items")?; report_items.call_method1( py, "append", (format!( "symlink {} has no target in revision {}", spr.0.file_id(), spr.0 .revision() .map_or_else(|| String::from("None"), |p| p.to_string()) ),), )?; } // Symlinks are stored as '' checker.call_method1( py, "add_pending_item", ( &rev_id, ("texts", spr.0.file_id(), spr.0.revision()), PyBytes::new(py, b"text"), PyBytes::new(py, b"da39a3ee5e6b4b0d3255bfef95601890afd80709"), ), )?; Ok(()) } } fn entry_to_py(py: Python, e: Entry) -> PyResult> { let kind = e.kind(); let init = PyClassInitializer::from(InventoryEntry(e)); match kind { Kind::File => { let init = init.add_subclass(InventoryFile()); Ok(Bound::new(py, init)?.into_any()) } Kind::Directory => { let init = init.add_subclass(InventoryDirectory()); Ok(Bound::new(py, init)?.into_any()) } Kind::TreeReference => { let init = init.add_subclass(TreeReference()); Ok(Bound::new(py, init)?.into_any()) } Kind::Symlink => { let init = init.add_subclass(InventoryLink()); Ok(Bound::new(py, init)?.into_any()) } } } fn entry_from_py(py: Python, obj: Py) -> PyResult { let kind = obj.getattr(py, "kind")?.extract::(py)?; let kind = match kind.as_str() { "file" => Kind::File, "directory" => Kind::Directory, "tree-reference" => Kind::TreeReference, "symlink" => Kind::Symlink, _ => panic!("Unknown kind"), }; let file_id = obj.getattr(py, "file_id")?.extract::>(py)?; let name = obj.getattr(py, "name")?.extract::(py)?; let parent_id = obj .getattr(py, "parent_id")? .extract::>(py)?; let revision = obj .getattr(py, "revision")? .extract::>(py)?; let executable = obj.getattr(py, "executable")?.extract::>(py)?; let text_id = obj.getattr(py, "text_id")?.extract::>>(py)?; let text_sha1 = obj .getattr(py, "text_sha1")? .extract::>>(py)?; let text_size = obj.getattr(py, "text_size")?.extract::>(py)?; let symlink_target = obj .getattr(py, "symlink_target")? .extract::>(py)?; let reference_revision = obj .getattr(py, "reference_revision")? .extract::>(py)?; let entry = bazaar::inventory::make_entry( kind, name, parent_id, file_id, revision, text_sha1, text_size, executable, text_id, symlink_target, reference_revision, ) .map_err(|e| inventory_err_to_py_err(e, py))?; Ok(entry) } #[pyfunction] #[allow(clippy::too_many_arguments)] #[pyo3(signature = (kind, name, parent_id=None, revision=None, file_id=None, text_sha1=None, text_size=None, executable=None, text_id=None, symlink_target=None, reference_revision=None))] fn make_entry<'a>( py: Python<'a>, kind: &'a str, name: &'a str, parent_id: Option, revision: Option, file_id: Option, text_sha1: Option>, text_size: Option, executable: Option, text_id: Option>, symlink_target: Option, reference_revision: Option, ) -> PyResult> { let kind = match kind { "file" => Kind::File, "directory" => Kind::Directory, "tree-reference" => Kind::TreeReference, "symlink" => Kind::Symlink, _ => panic!("Unknown kind"), }; entry_to_py( py, bazaar::inventory::make_entry( kind, name.to_string(), file_id, parent_id, revision, text_sha1, text_size, executable, text_id, symlink_target, reference_revision, ) .map_err(|e| inventory_err_to_py_err(e, py))?, ) } #[pyfunction] fn is_valid_name(name: &str) -> bool { bazaar::inventory::is_valid_name(name) } #[pyfunction] fn ensure_normalized_name(name: std::path::PathBuf) -> PyResult { let path = bazaar::inventory::ensure_normalized_name(name.as_path()) .map_err(|_e| InvalidNormalization::new_err(name.clone()))?; path.to_str().map(|s| s.to_string()).ok_or_else(|| { PyValueError::new_err(format!( "Invalid normalization for path: {}", name.display() )) }) } fn delta_err_to_py_err(py: Python, e: InventoryDeltaInconsistency) -> PyErr { match e { InventoryDeltaInconsistency::NoPath => { InconsistentDelta::new_err(("", "", "No path in entry")) } InventoryDeltaInconsistency::DuplicateFileId(ref path, ref fid) => { InconsistentDelta::new_err((path.clone(), fid.clone(), "repeated file_id")) } InventoryDeltaInconsistency::DuplicateOldPath(path, fid) => { InconsistentDelta::new_err((path, fid, "repeated path")) } InventoryDeltaInconsistency::DuplicateNewPath(path, fid) => { InconsistentDelta::new_err((path, fid, "repeated path")) } InventoryDeltaInconsistency::MismatchedId(path, fid1, fid2) => { InconsistentDelta::new_err((path, fid1, format!("mismatched id with entry {}", fid2))) } InventoryDeltaInconsistency::EntryWithoutPath(path, fid) => { InconsistentDelta::new_err((path, fid, "Entry with no new_path")) } InventoryDeltaInconsistency::PathWithoutEntry(path, fid) => { InconsistentDelta::new_err((path, fid, "new_path with no entry")) } InventoryDeltaInconsistency::OrphanedChild(fid) => { InconsistentDelta::new_err(("", fid, "orphaned child")) } InventoryDeltaInconsistency::NoSuchId(fid) => NoSuchId::new_err((py.None(), fid)), InventoryDeltaInconsistency::PathMismatch(fid, path1, path2) => { InconsistentDelta::new_err((path1, fid, format!("path mismatch != {}", path2))) } InventoryDeltaInconsistency::ParentMissing(fid) => { InconsistentDelta::new_err(("", fid, "parent missing")) } InventoryDeltaInconsistency::InvalidEntryName(name) => InvalidEntryName::new_err((name,)), InventoryDeltaInconsistency::FileIdCycle(fid, path, parent_path) => { InconsistentDelta::new_err((path, fid, format!("file_id cycle with {}", parent_path))) } InventoryDeltaInconsistency::ParentNotDirectory(path, fid) => { InconsistentDelta::new_err((path, fid, "parent is not a directory")) } InventoryDeltaInconsistency::PathAlreadyVersioned(name, parent_path) => { InconsistentDelta::new_err((name, parent_path, "path already versioned")) } } } #[pyclass] struct InventoryDelta(bazaar::inventory_delta::InventoryDelta); #[pymethods] impl InventoryDelta { #[new] #[allow(clippy::type_complexity)] #[pyo3(signature = (delta=None))] fn new( _py: Python, delta: Option< Vec<( Option, Option, FileId, Option>, )>, >, ) -> PyResult { let delta = delta.unwrap_or_default(); let delta = delta .into_iter() .map(|(old_name, new_name, file_id, entry)| { let old_name = old_name.as_deref(); let new_name = new_name.as_deref(); let entry = entry.as_ref().map(|e| e.0.clone()); InventoryDeltaEntry { old_path: old_name.map(|s| s.to_string()), new_path: new_name.map(|s| s.to_string()), file_id, new_entry: entry, } }) .collect::>(); Ok(Self(bazaar::inventory_delta::InventoryDelta::from(delta))) } fn __nonzero__(slf: PyRef) -> bool { !slf.0.is_empty() } fn sort(&mut self) { self.0.sort(); } fn __len__(&self) -> usize { self.0.len() } fn __richcmp__(&self, other: PyRef, op: CompareOp) -> PyResult> { match op { CompareOp::Eq => Ok(Some(self.0 == other.0)), CompareOp::Ne => Ok(Some(self.0 != other.0)), _ => Err(PyNotImplementedError::new_err( "Only == and != are supported", )), } } fn __getitem__<'a>( &self, py: Python<'a>, index: isize, ) -> PyResult<(Option, Option, FileId, Bound<'a, PyAny>)> { let index: usize = if index < 0 { (self.0.len() as isize + index) as usize } else { index as usize }; let entry = self .0 .get(index) .ok_or(PyIndexError::new_err("Index out of bounds"))?; Ok(( entry.old_path.clone(), entry.new_path.clone(), entry.file_id.clone(), entry.new_entry.as_ref().map_or_else( || Ok(py.None().into_bound(py)), |e| entry_to_py(py, e.clone()), )?, )) } fn check(&self, py: Python) -> PyResult<()> { self.0.check().map_err(|e| match e { InventoryDeltaInconsistency::NoPath => { InconsistentDelta::new_err(("", "", "No path in entry")) } InventoryDeltaInconsistency::DuplicateFileId(ref path, ref fid) => { InconsistentDelta::new_err((path.clone(), fid.clone(), "repeated file_id")) } InventoryDeltaInconsistency::DuplicateOldPath(path, fid) => { InconsistentDelta::new_err((path, fid, "repeated path")) } InventoryDeltaInconsistency::DuplicateNewPath(path, fid) => { InconsistentDelta::new_err((path, fid, "repeated path")) } InventoryDeltaInconsistency::MismatchedId(path, fid1, fid2) => { InconsistentDelta::new_err(( path, fid1, format!("mismatched id with entry {}", fid2), )) } InventoryDeltaInconsistency::PathMismatch(fid, path1, path2) => { InconsistentDelta::new_err(( path1, fid, format!("mismatched path with entry {}", path2), )) } InventoryDeltaInconsistency::OrphanedChild(fid) => { InconsistentDelta::new_err(("", fid, "orphaned child")) } InventoryDeltaInconsistency::ParentNotDirectory(path, fid) => { InconsistentDelta::new_err((path, fid, "parent not directory")) } InventoryDeltaInconsistency::ParentMissing(fid) => { InconsistentDelta::new_err(("", fid, "parent missing")) } InventoryDeltaInconsistency::NoSuchId(fid) => NoSuchId::new_err((py.None(), fid)), InventoryDeltaInconsistency::InvalidEntryName(n) => InvalidEntryName::new_err((n,)), InventoryDeltaInconsistency::FileIdCycle(fid, path, parent_path) => { InconsistentDelta::new_err(( path, fid, format!("file_id cycle with {}", parent_path), )) } InventoryDeltaInconsistency::PathAlreadyVersioned(path, fid) => { InconsistentDelta::new_err((path, fid, "path already versioned")) } InventoryDeltaInconsistency::EntryWithoutPath(path, fid) => { InconsistentDelta::new_err((path, fid, "Entry with no new_path")) } InventoryDeltaInconsistency::PathWithoutEntry(path, fid) => { InconsistentDelta::new_err((path, fid, "new_path with no entry")) } }) } fn __repr__(&self) -> String { format!("{:?}", self.0) } } fn inventory_err_to_py_err(e: Error, py: Python) -> PyErr { match e { Error::InvalidEntryName(name) => InvalidEntryName::new_err((name,)), Error::InvalidNormalization(n, _) => InvalidNormalization::new_err((n,)), Error::DuplicateFileId(fid, path) => DuplicateFileId::new_err((fid, path)), Error::NoSuchId(fid) => NoSuchId::new_err((py.None(), fid)), Error::ParentNotDirectory(path, fid) => { InconsistentDelta::new_err((path, fid, "parent not directory")) } Error::FileIdCycle(fid, path, parent_path) => { InconsistentDelta::new_err((path, fid, format!("file_id cycle with {}", parent_path))) } Error::ParentMissing(fid) => InconsistentDelta::new_err(("", fid, "parent missing")), Error::PathAlreadyVersioned(name, parent_path) => { AlreadyVersionedError::new_err(format!("{}/{}", parent_path, name)) } Error::ParentNotVersioned(path) => { NotVersionedError::new_err(format!("parent not versioned: {}", path)) } } } #[pyclass] struct Inventory(bazaar::inventory::MutableInventory); #[pymethods] impl Inventory { #[new] #[pyo3(signature = (root_id=b"TREE_ROOT".to_vec(), revision_id=None, root_revision=None))] fn new( root_id: Option>, revision_id: Option, root_revision: Option, ) -> PyResult { let root_id = root_id.map(bazaar::FileId::from); let mut inv = Inventory(bazaar::inventory::MutableInventory::new()); if let Some(root_id) = root_id { let root = bazaar::inventory::Entry::root(root_id, root_revision); inv.0.add(root).unwrap(); } else if root_revision.is_some() { return Err(PyTypeError::new_err("root_revision requires root_id")); } inv.0.revision_id = revision_id; Ok(inv) } #[getter] fn root<'py>(&self, py: Python<'py>) -> PyResult> { if let Some(root) = self.0.root() { entry_to_py(py, root.clone()) } else { Ok(py.None().into_bound(py)) } } fn add(&mut self, py: Python, entry: &InventoryEntry) -> PyResult<()> { self.0 .add(entry.0.clone()) .map_err(|e| inventory_err_to_py_err(e, py))?; Ok(()) } #[pyo3(signature = (relpath, kind, file_id=None, revision=None, text_sha1=None, text_size=None, executable=None, text_id=None, symlink_target=None, reference_revision=None))] fn add_path<'py>( &mut self, py: Python<'py>, relpath: &str, kind: osutils::Kind, file_id: Option, revision: Option, text_sha1: Option>, text_size: Option, executable: Option, text_id: Option>, symlink_target: Option, reference_revision: Option, ) -> PyResult> { let file_id = self .0 .add_path( relpath, kind, file_id, revision, text_sha1, text_size, executable, text_id, symlink_target, reference_revision, ) .map_err(|e| inventory_err_to_py_err(e, py))?; self.get_entry(py, file_id) } #[getter] fn get_revision_id(&self) -> Option { self.0.revision_id.as_ref().cloned() } #[setter] fn set_revision_id(&mut self, revision_id: Option) { self.0.revision_id = revision_id; } fn id2path(&self, py: Python, file_id: FileId) -> PyResult { self.0 .id2path(&file_id) .map_err(|e| inventory_err_to_py_err(e, py)) } fn path2id(&self, path: &str) -> Option { self.0.path2id(path).cloned() } fn is_root(&self, file_id: FileId) -> PyResult { Ok(self.0.is_root(file_id)) } fn has_filename(&self, name: &str) -> PyResult { Ok(self.0.has_filename(name)) } fn get_children<'py>( &self, py: Python<'py>, file_id: FileId, ) -> PyResult>> { let children = self.0.get_children(&file_id); if children.is_none() { return Err(NoSuchId::new_err((py.None(), file_id))); } let children = children.unwrap(); let mut result = HashMap::with_capacity(children.len()); for (name, child) in children { result.insert(name.to_string(), entry_to_py(py, child.clone())?); } Ok(result) } fn entries<'py>(&self, py: Python<'py>) -> PyResult)>> { let entries = self.0.entries(); let mut result = Vec::with_capacity(entries.len()); for (name, entry) in entries { result.push((name, entry_to_py(py, entry.clone())?)); } Ok(result) } fn rename_id(&mut self, py: Python, old_file_id: FileId, new_file_id: FileId) -> PyResult<()> { self.0 .rename_id(&old_file_id, &new_file_id) .map_err(|e| inventory_err_to_py_err(e, py)) } fn path2id_segments(&self, names: Vec) -> Option { let names = names.iter().map(|s| s.as_str()).collect::>(); self.0.path2id_segments(names.as_slice()).cloned() } fn filter(&self, py: Python, specific_fileids: HashSet) -> PyResult { let result = self .0 .filter(&specific_fileids.iter().collect()) .map_err(|e| inventory_err_to_py_err(e, py))?; Ok(Self(result)) } fn get_entry_by_path_partial<'py>( &self, py: Python<'py>, relpath: Py, ) -> PyResult<( Option>, Option>, Option>, )> { let ret = if let Ok(relpath) = relpath.extract::(py) { self.0.get_entry_by_path_partial(&relpath) } else if let Ok(segments) = relpath.extract::>(py) { let segments = segments.iter().map(|s| s.as_str()).collect::>(); self.0 .get_entry_by_path_segments_partial(segments.as_slice()) } else { return Err(PyTypeError::new_err("expected str or list of str")); }; if let Some((e, segments, missing)) = ret { Ok(( Some(entry_to_py(py, e.clone())?), Some(segments), Some(missing), )) } else { Ok((None, None, None)) } } fn get_entry_by_path<'py>( &self, py: Python<'py>, relpath: Py, ) -> PyResult>> { if let Ok(relpath) = relpath.extract::(py) { Ok(self .0 .get_entry_by_path(&relpath) .map(|entry| entry_to_py(py, entry.clone()).unwrap())) } else if let Ok(segments) = relpath.extract::>(py) { let segments = segments.iter().map(|s| s.as_str()).collect::>(); Ok(self .0 .get_entry_by_path_segments(segments.as_slice()) .map(|entry| entry_to_py(py, entry.clone()).unwrap())) } else { Err(PyTypeError::new_err("expected str or list of str")) } } #[pyo3(signature = (delta))] fn apply_delta( &mut self, py: Python, delta: Vec<( Option, Option, FileId, Option>, )>, ) -> PyResult<()> { let delta = bazaar::inventory_delta::InventoryDelta::from_iter(delta.into_iter().map( |(old_name, new_name, file_id, entry)| InventoryDeltaEntry { old_path: old_name, new_path: new_name, file_id, new_entry: entry.map(|entry| entry.0.clone()), }, )); self.0 .apply_delta(&delta) .map_err(|e| delta_err_to_py_err(py, e)) } #[pyo3(signature = (delta, new_revision_id))] fn create_by_apply_delta( &self, py: Python, delta: Vec<( Option, Option, FileId, Option>, )>, new_revision_id: RevisionId, ) -> PyResult { let delta = bazaar::inventory_delta::InventoryDelta::from_iter(delta.into_iter().map( |(old_name, new_name, file_id, entry)| InventoryDeltaEntry { old_path: old_name, new_path: new_name, file_id, new_entry: entry.map(|entry| entry.0.clone()), }, )); let result = self .0 .create_by_apply_delta(&delta, new_revision_id) .map_err(|e| delta_err_to_py_err(py, e))?; Ok(Self(result)) } fn __len__(&self) -> usize { self.0.len() } fn get_entry<'py>(&self, py: Python<'py>, file_id: FileId) -> PyResult> { self.0 .get_entry(&file_id) .map(|entry| entry_to_py(py, entry.clone()).unwrap()) .ok_or_else(|| NoSuchId::new_err((py.None(), file_id))) } fn get_file_kind(&self, file_id: FileId) -> Option<&str> { self.0.get_file_kind(&file_id).map(|kind| kind.to_string()) } fn has_id(&self, file_id: FileId) -> bool { self.0.has_id(&file_id) } fn get_child<'py>( &self, py: Python<'py>, file_id: FileId, name: &str, ) -> Option> { self.0 .get_child(&file_id, name) .map(|entry| entry_to_py(py, entry.clone()).unwrap()) } fn delete(&mut self, py: Python, file_id: FileId) -> PyResult<()> { self.0 .delete(&file_id) .map_err(|e| inventory_err_to_py_err(e, py)) } fn _make_delta<'py>( &self, py: Python<'py>, old: &Inventory, ) -> PyResult> { let inventory_delta = self.0.make_delta(&old.0); Bound::new(py, InventoryDelta(inventory_delta)) } fn remove_recursive_id<'a>( &mut self, py: Python<'a>, file_id: FileId, ) -> PyResult>> { self.0 .remove_recursive_id(&file_id) .into_iter() .map(|entry| entry_to_py(py, entry)) .collect::>>() } fn rename( &mut self, py: Python, file_id: FileId, new_parent_id: FileId, new_name: &str, ) -> PyResult<()> { self.0 .rename(&file_id, &new_parent_id, new_name) .map_err(|e| inventory_err_to_py_err(e, py)) } fn iter_sorted_children<'a>( &self, py: Python<'a>, file_id: FileId, ) -> PyResult>> { let children = self.0.iter_sorted_children(&file_id); if children.is_none() { return Err(NoSuchId::new_err((py.None(), file_id))); } children .unwrap() .map(|(_n, e)| Ok(entry_to_py(py, e.clone())?.into_any())) .collect::>>() } fn iter_all_ids<'a>(&self, py: Python<'a>) -> PyResult> { let ids = self.0.iter_all_ids(); ids.into_iter() .collect::>() .into_pyobject(py)? .call_method0("__iter__") } #[pyo3(signature = (from_dir=None, recursive=true))] fn iter_entries( slf: Py, py: Python, from_dir: Option, recursive: Option, ) -> PyResult> { let recursive = recursive.unwrap_or(true); Bound::new(py, IterEntriesIterator::new(py, slf, from_dir, recursive)?) } #[pyo3(signature = (from_dir=None, specific_file_ids=None))] fn iter_entries_by_dir( slf: Py, py: Python, from_dir: Option, specific_file_ids: Option>, ) -> PyResult> { Bound::new( py, IterEntriesByDirIterator::new(py, slf, from_dir, specific_file_ids)?, ) } fn change_root_id(&mut self, new_root_id: FileId) -> PyResult<()> { self.0.change_root_id(new_root_id); Ok(()) } fn copy(&self) -> Self { Self(self.0.clone()) } #[pyo3(signature = (kind, name, parent_id=None, file_id=None, revision=None, text_sha1=None, text_size=None, text_id=None, executable=None, symlink_target=None, reference_revision=None))] #[allow(clippy::too_many_arguments)] fn make_entry<'a>( &self, py: Python<'a>, kind: &str, name: &str, parent_id: Option, file_id: Option, revision: Option, text_sha1: Option>, text_size: Option, text_id: Option>, executable: Option, symlink_target: Option, reference_revision: Option, ) -> PyResult> { let kind = match kind { "directory" => Kind::Directory, "file" => Kind::File, "symlink" => Kind::Symlink, "tree-reference" => Kind::TreeReference, _ => return Err(PyValueError::new_err(format!("Unknown kind: {}", kind))), }; let entry = bazaar::inventory::make_entry( kind, name.to_string(), parent_id, file_id, revision, text_sha1, text_size, executable, text_id, symlink_target, reference_revision, ) .map_err(|e| inventory_err_to_py_err(e, py))?; entry_to_py(py, entry) } pub fn __richcmp__(&self, other: PyRef, op: CompareOp) -> PyResult { match op { CompareOp::Eq => Ok(self.0 == other.0), CompareOp::Ne => Ok(self.0 != other.0), _ => Err(PyNotImplementedError::new_err( "Only == and != are implemented", )), } } } #[pyclass] struct IterEntriesByDirIterator { inv: Py, parents: Option>, stack: Vec<(String, FileId)>, children: VecDeque<(String, Entry)>, specific_file_ids: Option>, } impl IterEntriesByDirIterator { fn new( py: Python, inv: Py, from_dir: Option, specific_file_ids: Option>, ) -> PyResult { let parents = specific_file_ids.as_ref().map(|specific_file_ids| { bazaar::inventory::find_interesting_parents( &inv.borrow(py).0, &specific_file_ids.iter().collect(), ) .into_iter() .cloned() .collect() }); let mut stack: Vec<(String, FileId)> = vec![]; let from_dir = if let Some(from_dir) = from_dir { let inv = &inv.borrow(py).0; let e = inv.get_entry(&from_dir); if e.is_none() { return Err(NoSuchId::new_err((py.None(), from_dir))); } let e = e.unwrap(); if e.kind() != Kind::Directory { return Err(NotADirectory::new_err(from_dir)); } Some(from_dir) } else { inv.borrow(py).0.root().map(|e| e.file_id().clone()) }; let mut children = VecDeque::new(); if let Some(from_dir) = from_dir.as_ref() { assert!( inv.borrow(py).0.get_children(from_dir).is_some(), "from_dir {:?} must be a directory", from_dir ); stack.push(("".to_string(), from_dir.clone())); if specific_file_ids.is_none() || specific_file_ids.as_ref().unwrap().contains(from_dir) { children.push_front(( "".to_string(), inv.borrow(py).0.get_entry(from_dir).unwrap().clone(), )); } } Ok(Self { inv, parents, children, stack, specific_file_ids, }) } } #[pymethods] impl IterEntriesByDirIterator { fn __iter__(slf: PyRef) -> PyResult> { Ok(slf.into()) } fn __next__<'py>(&mut self, py: Python<'py>) -> PyResult)>> { loop { if let Some((relpath, ie)) = self.children.pop_front() { return Ok(Some((relpath, entry_to_py(py, ie)?))); } if let Some((cur_relpath, cur_dir)) = self.stack.pop() { let mut child_dirs = Vec::new(); let inv = &self.inv.borrow(py).0; for (child_name, child_ie) in inv .iter_sorted_children(&cur_dir) .expect("should be known directory") { let child_relpath = cur_relpath.to_string() + child_name; if self.specific_file_ids.is_none() || self .specific_file_ids .as_ref() .unwrap() .contains(child_ie.file_id()) { self.children .push_back((child_relpath.clone(), child_ie.clone())); } if child_ie.kind() == Kind::Directory && (self.parents.is_none() || self.parents.as_ref().unwrap().contains(child_ie.file_id())) { assert!(self .inv .borrow(py) .0 .get_children(child_ie.file_id()) .is_some()); child_dirs.push((child_relpath + "/", child_ie.file_id())) } } self.stack .extend(child_dirs.into_iter().rev().map(|(n, f)| (n, f.clone()))); } else { return Ok(None); } } } } #[pyclass] struct IterEntriesIterator { inv: Py, stack: VecDeque<(String, VecDeque<(String, Entry)>)>, recursive: bool, first_entry: Option, } impl IterEntriesIterator { fn new( py: Python<'_>, inv: Py, mut from_dir: Option, recursive: bool, ) -> PyResult { let mut stack = VecDeque::new(); let first_entry = if from_dir.is_none() { from_dir = inv.borrow(py).0.root().map(|e| e.file_id().clone()); inv.borrow(py).0.root().cloned() } else { None }; if let Some(from_dir) = from_dir.as_ref() { let inv = &inv.borrow(py).0; let children = inv.iter_sorted_children(from_dir); if children.is_none() { return Err(NoSuchId::new_err((py.None(), from_dir.clone()))); } stack.push_back(( String::new(), children .unwrap() .map(|(p, ie)| (p.to_string(), ie.clone())) .collect::>(), )); } Ok(Self { inv, stack, recursive, first_entry, }) } } #[pymethods] impl IterEntriesIterator { fn __iter__(slf: PyRef) -> PyResult> { Ok(slf.into()) } fn __next__<'py>(&mut self, py: Python<'py>) -> PyResult)>> { if let Some(first_entry) = self.first_entry.take() { return Ok(Some((String::new(), entry_to_py(py, first_entry)?))); } loop { if let Some((base, children)) = self.stack.back_mut() { if let Some((name, ie)) = children.pop_front() { let path = if base.is_empty() { name } else { format!("{}/{}", base, name) }; if ie.kind() == Kind::Directory && self.recursive { let children = self .inv .borrow(py) .0 .iter_sorted_children(ie.file_id()) .unwrap() .map(|(p, ie)| (p.to_string(), ie.clone())) .collect::>(); self.stack.push_back((path.clone(), children)); } return Ok(Some((path, entry_to_py(py, ie)?))); } else { self.stack.pop_back(); } } else { return Ok(None); } } } } #[pyfunction] #[pyo3(signature = (lines, allow_versioned_root=None, allow_tree_references=None))] fn parse_inventory_delta( py: Python, lines: Vec>, allow_versioned_root: Option, allow_tree_references: Option, ) -> PyResult<( Bound, Bound, bool, bool, Bound, )> { let (parent, version, versioned_root, tree_references, result) = bazaar::inventory_delta::parse_inventory_delta( lines .iter() .map(|x| x.as_slice()) .collect::>() .as_slice(), allow_versioned_root, allow_tree_references, ) .map_err(|e| match e { InventoryDeltaParseError::Invalid(m) => InventoryDeltaError::new_err((m,)), InventoryDeltaParseError::Incompatible(m) => IncompatibleInventoryDelta::new_err((m,)), })?; let result = Bound::new(py, InventoryDelta(result))?; Ok(( parent.into_pyobject(py)?, version.into_pyobject(py)?, versioned_root, tree_references, result, )) } #[pyfunction(signature = (file_id, name, parent_id, revision, lines))] fn parse_inventory_entry( file_id: FileId, name: String, parent_id: Option, revision: Option, lines: &[u8], ) -> InventoryEntry { InventoryEntry(bazaar::inventory_delta::parse_inventory_entry( file_id, name, parent_id, revision, lines, )) } #[pyfunction] fn serialize_inventory_entry<'a>( py: Python<'a>, entry: &'a InventoryEntry, ) -> PyResult> { Ok(PyBytes::new( py, bazaar::inventory_delta::serialize_inventory_entry(&entry.0) .map_err(|e| match e { InventoryDeltaSerializeError::Invalid(m) => InventoryDeltaError::new_err((m,)), InventoryDeltaSerializeError::UnsupportedKind(k) => PyKeyError::new_err((k,)), })? .as_slice(), )) } #[pyfunction] fn serialize_inventory_delta<'a>( py: Python<'a>, old_name: RevisionId, new_name: RevisionId, delta_to_new: &'a InventoryDelta, versioned_root: bool, tree_references: bool, ) -> PyResult>> { Ok(bazaar::inventory_delta::serialize_inventory_delta( &old_name, &new_name, &delta_to_new.0, versioned_root, tree_references, ) .map_err(|e| match e { InventoryDeltaSerializeError::Invalid(m) => InventoryDeltaError::new_err((m,)), InventoryDeltaSerializeError::UnsupportedKind(m) => PyKeyError::new_err((m,)), })? .into_iter() .map(|x| PyBytes::new(py, x.as_slice())) .collect()) } #[pyfunction] fn chk_inventory_entry_to_bytes<'a>( py: Python<'a>, entry: &'a InventoryEntry, ) -> PyResult> { Ok(PyBytes::new( py, bazaar::chk_inventory::chk_inventory_entry_to_bytes(&entry.0).as_slice(), )) } #[pyfunction] pub fn chk_inventory_bytes_to_entry<'py>( py: Python<'py>, data: &[u8], ) -> PyResult> { entry_to_py( py, bazaar::chk_inventory::chk_inventory_bytes_to_entry(data), ) } #[pyfunction] fn chk_inventory_bytes_to_utf8name_key<'py>( py: Python<'py>, data: &[u8], ) -> PyResult<(Bound<'py, PyBytes>, FileId, RevisionId)> { let (name, file_id, revision_id) = bazaar::chk_inventory::chk_inventory_bytes_to_utf8_name_key(data); Ok((PyBytes::new(py, name), file_id, revision_id)) } pub fn _inventory_rs(py: Python) -> PyResult> { let m = PyModule::new(py, "inventory")?; m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_wrapped(wrap_pyfunction!(make_entry))?; m.add_wrapped(wrap_pyfunction!(is_valid_name))?; m.add_wrapped(wrap_pyfunction!(ensure_normalized_name))?; m.add_class::()?; m.add_class::()?; m.add_wrapped(wrap_pyfunction!(parse_inventory_delta))?; m.add_wrapped(wrap_pyfunction!(parse_inventory_entry))?; m.add_wrapped(wrap_pyfunction!(serialize_inventory_delta))?; m.add_wrapped(wrap_pyfunction!(serialize_inventory_entry))?; m.add("InventoryDeltaError", py.get_type::())?; m.add( "IncompatibleInventoryDelta", py.get_type::(), )?; m.add_wrapped(wrap_pyfunction!(chk_inventory_entry_to_bytes))?; m.add_wrapped(wrap_pyfunction!(chk_inventory_bytes_to_entry))?; m.add_wrapped(wrap_pyfunction!(chk_inventory_bytes_to_utf8name_key))?; Ok(m) } bzrformats_3.4.0.orig/crates/bazaar-py/src/lib.rs0000644000000000000000000004111015162074037016732 0ustar00use bazaar::RevisionId; use chrono::NaiveDateTime; use pyo3::class::basic::CompareOp; use pyo3::exceptions::{PyNotImplementedError, PyRuntimeError, PyTypeError, PyValueError}; use pyo3::import_exception; use pyo3::prelude::*; use pyo3::types::{PyBytes, PyString}; use pyo3_filelike::PyBinaryFile; use std::collections::HashMap; mod chk_map; mod dirstate; mod groupcompress; mod inventory; mod smart; mod versionedfile; import_exception!(bzrformats.errors, ReservedId); /// Create a new file id suffix that is reasonably unique. /// /// On the first call we combine the current time with 64 bits of randomness to /// give a highly probably globally unique number. Then each call in the same /// process adds 1 to a serial number we append to that unique value. #[pyfunction] #[pyo3(signature = (suffix = None))] fn _next_id_suffix<'py>(py: Python<'py>, suffix: Option<&str>) -> Bound<'py, PyBytes> { PyBytes::new(py, bazaar::gen_ids::next_id_suffix(suffix).as_slice()) } /// Return new file id for the basename 'name'. /// /// The uniqueness is supplied from _next_id_suffix. #[pyfunction] fn gen_file_id(name: &str) -> bazaar::FileId { bazaar::FileId::generate(name) } /// Return a new tree-root file id. #[pyfunction] fn gen_root_id() -> bazaar::FileId { bazaar::FileId::generate_root_id() } /// Return new revision-id. /// /// Args: /// username: The username of the committer, in the format returned by /// config.username(). This is typically a real name, followed by an /// email address. If found, we will use just the email address portion. /// Otherwise we flatten the real name, and use that. /// Returns: A new revision id. #[pyfunction] #[pyo3(signature = (username, timestamp = None))] fn gen_revision_id( py: Python, username: &str, timestamp: Option>, ) -> PyResult { let timestamp = match timestamp { Some(timestamp) => { if let Ok(timestamp) = timestamp.extract::(py) { Some(timestamp as u64) } else if let Ok(timestamp) = timestamp.extract::(py) { Some(timestamp) } else { return Err(PyTypeError::new_err( "timestamp must be a float or an int".to_string(), )); } } None => None, }; Ok(bazaar::RevisionId::generate(username, timestamp)) } #[pyfunction] fn normalize_pattern(pattern: &str) -> String { bazaar::globbing::normalize_pattern(pattern) } #[pyclass] struct Replacer { replacer: bazaar::globbing::Replacer, } #[pymethods] impl Replacer { #[new] #[pyo3(signature = (source = None))] fn new(source: Option<&Self>) -> Self { Self { replacer: bazaar::globbing::Replacer::new(source.map(|p| &p.replacer)), } } /// Add a pattern and replacement. /// /// The pattern must not contain capturing groups. /// The replacement might be either a string template in which \& will be /// replaced with the match, or a function that will get the matching text /// as argument. It does not get match object, because capturing is /// forbidden anyway. fn add(&mut self, py: Python, pattern: &str, func: Py) -> PyResult<()> { if let Ok(func) = func.extract::(py) { self.replacer .add(pattern, bazaar::globbing::Replacement::String(func)); Ok(()) } else { let callable = Box::new(move |t: String| -> String { Python::attach(|py| match func.call1(py, (t,)) { Ok(result) => result.extract::(py).unwrap(), Err(e) => { e.restore(py); String::new() } }) }); self.replacer .add(pattern, bazaar::globbing::Replacement::Closure(callable)); Ok(()) } } /// Add all patterns from another replacer. /// /// All patterns and replacements from replacer are appended to the ones /// already defined. fn add_replacer(&mut self, replacer: &Self) { self.replacer.add_replacer(&replacer.replacer) } fn __call__(&mut self, py: Python, text: &str) -> PyResult { let ret = self .replacer .replace(text) .map_err(|e| PyValueError::new_err(e.to_string()))?; if PyErr::occurred(py) { Err(PyErr::fetch(py)) } else { Ok(ret) } } } #[pyclass(subclass)] struct Revision(bazaar::revision::Revision); /// Single revision on a branch. /// /// Revisions may know their revision_hash, but only once they've been /// written out. This is not stored because you cannot write the hash /// into the file it describes. /// /// Attributes: /// parent_ids: List of parent revision_ids /// /// properties: /// Dictionary of revision properties. These are attached to the /// revision as extra metadata. The name must be a single /// word; the value can be an arbitrary string. #[pymethods] impl Revision { #[new] #[pyo3(signature = (revision_id, parent_ids, committer, message, properties, inventory_sha1, timestamp, timezone))] fn new( py: Python, revision_id: RevisionId, parent_ids: Vec, committer: Option, message: String, properties: Option>>, inventory_sha1: Option>, timestamp: f64, timezone: Option, ) -> PyResult { let mut cproperties: HashMap> = HashMap::new(); for (k, v) in properties.unwrap_or_default() { if let Ok(s) = v.extract::>(py) { cproperties.insert(k, s.as_bytes().to_vec()); } else if let Ok(s) = v.extract::>(py) { let s = s .call_method1("encode", ("utf-8", "surrogateescape"))? .extract::>()?; cproperties.insert(k, s.as_bytes().to_vec()); } else { return Err(PyTypeError::new_err( "properties must be a dictionary of strings", )); } } if !bazaar::revision::validate_properties(&cproperties) { return Err(PyValueError::new_err( "properties must be a dictionary of strings", )); } Ok(Self(bazaar::revision::Revision { revision_id, parent_ids, committer, message, properties: cproperties, inventory_sha1, timestamp, timezone, })) } fn __richcmp__(&self, other: &Self, op: CompareOp) -> PyResult { match op { CompareOp::Eq => Ok(self.0 == other.0), CompareOp::Ne => Ok(self.0 != other.0), _ => Err(PyNotImplementedError::new_err( "only == and != are supported", )), } } fn __repr__(self_: PyRef) -> String { format!("", self_.0.revision_id) } #[getter] fn revision_id(&self) -> &bazaar::RevisionId { &self.0.revision_id } #[getter] fn parent_ids(&self) -> &Vec { &self.0.parent_ids } #[getter] fn committer(&self) -> Option { self.0.committer.clone() } #[getter] fn message(&self) -> String { self.0.message.clone() } #[getter] fn properties(&self) -> HashMap { self.0 .properties .iter() .map(|(k, v)| (k.clone(), String::from_utf8_lossy(v).into())) .collect() } #[getter] fn get_inventory_sha1<'py>(&self, py: Python<'py>) -> Bound<'py, PyAny> { if let Some(sha1) = &self.0.inventory_sha1 { PyBytes::new(py, sha1).into_any() } else { py.None().into_bound(py) } } #[setter] fn set_inventory_sha1(&mut self, py: Python, value: Py) -> PyResult<()> { if let Ok(value) = value.extract::>(py) { self.0.inventory_sha1 = Some(value.as_bytes().to_vec()); Ok(()) } else if value.is_none(py) { self.0.inventory_sha1 = None; Ok(()) } else { Err(PyTypeError::new_err("expected bytes or None")) } } #[getter] fn timestamp(&self) -> f64 { self.0.timestamp } #[getter] fn timezone(&self) -> Option { self.0.timezone } fn datetime(&self) -> NaiveDateTime { self.0.datetime() } fn check_properties(&self) -> PyResult<()> { if self.0.check_properties() { Ok(()) } else { Err(PyValueError::new_err("invalid properties")) } } fn get_summary(&self) -> String { self.0.get_summary() } fn get_apparent_authors(&self) -> Vec { self.0.get_apparent_authors() } fn bug_urls(&self) -> Vec { self.0.bug_urls() } } fn serializer_err_to_py_err(e: bazaar::serializer::Error) -> PyErr { PyRuntimeError::new_err(format!("serializer error: {:?}", e)) } #[pyclass(subclass)] struct RevisionSerializer(Box); #[pyclass(subclass,extends=RevisionSerializer)] struct BEncodeRevisionSerializerv1; #[pymethods] impl BEncodeRevisionSerializerv1 { #[new] fn new() -> (Self, RevisionSerializer) { ( Self {}, RevisionSerializer(Box::new( bazaar::bencode_serializer::BEncodeRevisionSerializer1, )), ) } } #[pyclass(subclass,extends=RevisionSerializer)] struct XMLRevisionSerializer8; #[pymethods] impl XMLRevisionSerializer8 { #[new] fn new() -> (Self, RevisionSerializer) { ( Self {}, RevisionSerializer(Box::new(bazaar::xml_serializer::XMLRevisionSerializer8)), ) } } #[pyclass(subclass,extends=RevisionSerializer)] struct XMLRevisionSerializer5; #[pymethods] impl XMLRevisionSerializer5 { #[new] fn new() -> (Self, RevisionSerializer) { ( Self {}, RevisionSerializer(Box::new(bazaar::xml_serializer::XMLRevisionSerializer5)), ) } } #[pymethods] impl RevisionSerializer { #[getter] fn format_name(&self) -> String { self.0.format_name().to_string() } #[getter] fn squashes_xml_invalid_characters(&self) -> bool { self.0.squashes_xml_invalid_characters() } fn read_revision(&self, py: Python, file: Py) -> PyResult { py.detach(|| { let mut file = PyBinaryFile::from(file); Ok(Revision( self.0 .read_revision(&mut file) .map_err(serializer_err_to_py_err)?, )) }) } fn write_revision_to_string<'py>( &self, py: Python<'py>, revision: &Revision, ) -> PyResult> { Ok(PyBytes::new( py, py.detach(|| self.0.write_revision_to_string(&revision.0)) .map_err(serializer_err_to_py_err)? .as_slice(), )) } fn write_revision_to_lines<'a>( &self, py: Python<'a>, revision: &Revision, ) -> PyResult>> { self.0 .write_revision_to_lines(&revision.0) .map(|s| -> PyResult> { Ok(PyBytes::new( py, s.map_err(serializer_err_to_py_err)?.as_slice(), )) }) .collect::>>>() } fn read_revision_from_string(&self, py: Python, string: &[u8]) -> PyResult { Ok(Revision( py.detach(|| self.0.read_revision_from_string(string)) .map_err(serializer_err_to_py_err)?, )) } } #[pyfunction(name = "is_null")] fn is_null_revision(revision_id: RevisionId) -> bool { revision_id.is_null() } #[pyfunction(name = "is_reserved_id")] fn is_reserved_revision_id(revision_id: RevisionId) -> bool { revision_id.is_reserved() } #[pyfunction(name = "check_not_reserved_id")] fn check_not_reserved_id(_py: Python, revision_id: Bound) -> PyResult<()> { if revision_id.is_none() { return Ok(()); } if let Ok(revision_id) = revision_id.extract::() { if revision_id.is_reserved() { Err(ReservedId::new_err((revision_id,))) } else { Ok(()) } } else { // For now, just ignore other types.. Ok(()) } } #[pyfunction] #[pyo3(signature = (message = None))] fn escape_invalid_chars(message: Option<&str>) -> (Option, usize) { if let Some(message) = message { ( Some(bazaar::xml_serializer::escape_invalid_chars(message)), message.len(), ) } else { (None, 0) } } #[pyfunction] fn encode_and_escape(py: Python, unicode_or_utf8_str: Py) -> PyResult> { let ret = if let Ok(text) = unicode_or_utf8_str.extract::(py) { bazaar::xml_serializer::encode_and_escape_string(&text) } else if let Ok(bytes) = unicode_or_utf8_str.extract::>(py) { bazaar::xml_serializer::encode_and_escape_bytes(&bytes) } else { return Err(PyTypeError::new_err("expected str or bytes")); }; Ok(PyBytes::new(py, ret.as_bytes())) } mod hashcache; mod rio; #[pymodule] fn _bzr_rs(py: Python, m: &Bound) -> PyResult<()> { m.add_wrapped(wrap_pyfunction!(_next_id_suffix))?; m.add_wrapped(wrap_pyfunction!(gen_file_id))?; m.add_wrapped(wrap_pyfunction!(gen_root_id))?; m.add_wrapped(wrap_pyfunction!(gen_revision_id))?; let m_globbing = PyModule::new(py, "globbing")?; m_globbing.add_wrapped(wrap_pyfunction!(normalize_pattern))?; m_globbing.add_class::()?; m.add_submodule(&m_globbing)?; m.add_class::()?; let inventorym = inventory::_inventory_rs(py)?; m.add_submodule(&inventorym)?; m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add( "revision_bencode_serializer", m.getattr("BEncodeRevisionSerializerv1")?.call0()?, )?; m.add( "revision_serializer_v8", m.getattr("XMLRevisionSerializer8")?.call0()?, )?; m.add( "revision_serializer_v5", m.getattr("XMLRevisionSerializer5")?.call0()?, )?; m.add("CURRENT_REVISION", bazaar::CURRENT_REVISION)?; m.add("NULL_REVISION", bazaar::NULL_REVISION)?; m.add("ROOT_ID", bazaar::inventory::ROOT_ID)?; m.add_wrapped(wrap_pyfunction!(is_null_revision))?; m.add_wrapped(wrap_pyfunction!(is_reserved_revision_id))?; m.add_wrapped(wrap_pyfunction!(check_not_reserved_id))?; m.add_wrapped(wrap_pyfunction!(escape_invalid_chars))?; m.add_wrapped(wrap_pyfunction!(encode_and_escape))?; let riom = PyModule::new(py, "rio")?; rio::rio(&riom)?; m.add_submodule(&riom)?; let hashcachem = PyModule::new(py, "hashcache")?; hashcache::hashcache(&hashcachem)?; m.add_submodule(&hashcachem)?; let dirstatem = dirstate::_dirstate_rs(py)?; m.add_submodule(&dirstatem)?; let groupcompressm = groupcompress::_groupcompress_rs(py)?; m.add_submodule(&groupcompressm)?; let chk_mapm = chk_map::_chk_map_rs(py)?; m.add_submodule(&chk_mapm)?; let smartm = smart::_smart_rs(py)?; m.add_submodule(&smartm)?; let versionedfilem = versionedfile::_versionedfile_rs(py)?; m.add_submodule(&versionedfilem)?; // PyO3 submodule hack for proper import support let sys = py.import("sys")?; let modules = sys.getattr("modules")?; let module_name = m.name()?; // Register submodules in sys.modules for dotted import support modules.set_item(format!("{}.globbing", module_name), &m_globbing)?; modules.set_item(format!("{}.inventory", module_name), &inventorym)?; modules.set_item(format!("{}.rio", module_name), &riom)?; modules.set_item(format!("{}.hashcache", module_name), &hashcachem)?; modules.set_item(format!("{}.dirstate", module_name), &dirstatem)?; modules.set_item(format!("{}.groupcompress", module_name), &groupcompressm)?; modules.set_item(format!("{}.chk_map", module_name), &chk_mapm)?; modules.set_item(format!("{}.smart", module_name), &smartm)?; modules.set_item(format!("{}.versionedfile", module_name), &versionedfilem)?; Ok(()) } bzrformats_3.4.0.orig/crates/bazaar-py/src/rio.rs0000644000000000000000000002652315162074037016770 0ustar00use pyo3::prelude::*; use pyo3::wrap_pyfunction; use pyo3::exceptions::{PyIOError, PyNotImplementedError, PyTypeError, PyValueError}; use pyo3::types::{PyBytes, PyDict, PyIterator, PyList, PyString, PyType}; use pyo3::class::basic::CompareOp; use std::io::BufReader; use pyo3_filelike::PyBinaryFile; #[pyfunction] fn valid_tag(tag: &str) -> bool { bazaar::rio::valid_tag(tag) } #[pyclass] #[derive(Clone, PartialEq)] struct Stanza { stanza: bazaar::rio::Stanza, } #[pymethods] impl Stanza { #[new] #[pyo3(signature = (**kwargs))] fn new(kwargs: Option<&Bound>) -> PyResult { let mut obj = Stanza { stanza: bazaar::rio::Stanza::new(), }; if let Some(kwargs) = kwargs { let items = kwargs.items(); items.sort()?; for item in items.iter() { let (key, value) = item.extract::<(String, Bound)>()?; obj.add(&key.to_string(), &value)?; } } Ok(obj) } fn __richcmp__(&self, other: &Bound, op: CompareOp) -> PyResult { match op { CompareOp::Eq => { let other_stanza = other.extract::(); if other_stanza.is_err() { Ok(false) } else { Ok(self.stanza.eq(&other_stanza.unwrap().stanza)) } } _ => Err(PyErr::new::("Not implemented")), } } fn __repr__(&self) -> PyResult { Ok(format!("{:?}", self.stanza)) } fn get<'py>(&self, tag: &str, py: Python<'py>) -> PyResult>> { if let Some(value) = self.stanza.get(tag) { match value { bazaar::rio::StanzaValue::String(v) => Ok(Some(PyString::new(py, v).into_any())), bazaar::rio::StanzaValue::Stanza(v) => Ok(Some( Bound::new(py, Stanza { stanza: *v.clone() })?.into_any(), )), } } else { Ok(None) } } /// Returns true if the stanza contains the given tag. fn __contains__(&self, tag: &str) -> PyResult { Ok(self.stanza.contains(tag)) } fn __len__(&self) -> PyResult { Ok(self.stanza.len()) } fn to_bytes<'a>(&self, py: Python<'a>) -> PyResult> { let ret: Bound = PyBytes::new(py, self.stanza.to_bytes().as_slice()); Ok(ret) } fn to_string<'a>(&self, py: Python<'a>) -> PyResult> { self.to_bytes(py) } fn to_lines(&self, py: Python) -> PyResult> { let ret = PyList::empty(py); for line in self.stanza.to_lines() { ret.append(PyBytes::new(py, line.as_bytes()))?; } Ok(ret.into()) } /// Add a tag and value to the stanza. fn add(&mut self, tag: &str, value: &Bound) -> PyResult<()> { if !valid_tag(tag) { return Err(PyErr::new::("Invalid tag")); } // If the type of value is PyString, then extract it as a String and add it to the stanza. // Otherwise, if the type of value is Stanza, then extract it as a Stanza and add it to the stanza. // Otherwise, return an error. let ret = if let Ok(val) = value.extract::() { self.stanza .add(tag.to_string(), bazaar::rio::StanzaValue::String(val)) } else if let Ok(val) = value.extract::() { self.stanza.add( tag.to_string(), bazaar::rio::StanzaValue::Stanza(Box::new(val.stanza)), ) } else { return Err(PyErr::new::(format!( "Invalid value: {}", value.repr()? ))); }; if let Err(e) = ret { if let bazaar::rio::Error::Io(e) = e { return Err(PyErr::new::(format!("IO error: {}", e))); } else { return Err(PyErr::new::(format!( "Invalid value: {}", value.repr()? ))); } } Ok(()) } /// Create a stanza from a list of pairs. #[classmethod] fn from_pairs(_cls: &Bound, pairs: Vec<(String, Bound)>) -> PyResult { let mut ret = Stanza::new(None)?; for (tag, value) in pairs { ret.add(tag.as_str(), &value)?; } Ok(ret) } // TODO: This is a hack to get around the fact that PyO3 doesn't support returning an iterator. fn iter_pairs<'a>(&self, py: Python<'a>) -> PyResult> { let ret = PyList::empty(py); for (tag, value) in self.stanza.iter_pairs() { match value { bazaar::rio::StanzaValue::String(v) => { ret.append((tag.to_string(), v.to_string()))? } bazaar::rio::StanzaValue::Stanza(v) => { let sub: Stanza = Stanza { stanza: *v.clone() }; ret.append((tag.to_string(), sub))?; } } } PyIterator::from_object(&ret) } fn as_dict(&self, py: Python) -> PyResult> { let ret = PyDict::new(py); for (tag, value) in self.stanza.iter_pairs() { match value { bazaar::rio::StanzaValue::String(v) => ret.set_item(tag, v.to_string())?, bazaar::rio::StanzaValue::Stanza(v) => { let sub: Stanza = Stanza { stanza: *v.clone() }; ret.set_item(tag, sub)?; } } } Ok(ret.into()) } fn get_all(&self, tag: &str, py: Python) -> PyResult> { let ret = PyList::empty(py); for value in self.stanza.get_all(tag) { match value { bazaar::rio::StanzaValue::String(v) => ret.append(v.to_string())?, bazaar::rio::StanzaValue::Stanza(v) => { let sub: Stanza = Stanza { stanza: *v.clone() }; ret.append(sub.into_pyobject(py)?)?; } } } Ok(ret.into()) } fn write(&self, file: Py) -> PyResult<()> { let mut writer = PyBinaryFile::from(file); self.stanza.write(&mut writer)?; Ok(()) } } #[pyclass] struct RioWriter { writer: bazaar::rio::RioWriter, } #[pymethods] impl RioWriter { #[new] fn new(file: Py) -> PyResult { let fw = PyBinaryFile::from(file); let writer = bazaar::rio::RioWriter::new(fw); Ok(RioWriter { writer }) } fn write_stanza(&mut self, stanza: &Stanza) -> PyResult<()> { self.writer.write_stanza(&stanza.stanza)?; Ok(()) } } #[pyfunction] fn read_stanza_file(file: Py) -> PyResult> { let reader = PyBinaryFile::from(file); let mut reader = BufReader::new(reader); let stanza = bazaar::rio::read_stanza_file(&mut reader).map_err(|e| match e { bazaar::rio::Error::Io(e) => { PyErr::new::(format!("Error reading stanza file: {}", e)) } _ => PyErr::new::("Error reading stanza file".to_string()), })?; if let Some(stanza) = stanza { Ok(Some(Stanza { stanza })) } else { Ok(None) } } #[pyfunction] fn read_stanza(file: &Bound) -> PyResult> { let mut py_iter = file.try_iter()?; let mut pyerr: Option = None; let line_iter = std::iter::from_fn(|| -> Option, bazaar::rio::Error>> { let line = py_iter.next()?; if let Err(e) = line { pyerr = Some(e); Some(Err(bazaar::rio::Error::Other("Python error".to_string()))) } else { let line = line.unwrap(); let line = line.extract::>(); if let Err(e) = line { pyerr = Some(e); Some(Err(bazaar::rio::Error::Other("invalid input".to_string()))) } else { Some(Ok(line.unwrap())) } } }); let stanza = bazaar::rio::read_stanza(line_iter).map_err(|e| { if let Some(e) = pyerr { return e; } match e { bazaar::rio::Error::Io(e) => { PyErr::new::(format!("Error reading stanza: {}", e)) } _ => PyErr::new::("Error reading stanza".to_string()), } })?; if let Some(stanza) = stanza { Ok(Some(Stanza { stanza })) } else { Ok(None) } } #[pyfunction] fn read_stanzas(file: Py) -> PyResult> { Python::attach(|py| { let reader = PyBinaryFile::from(file); let ret = PyList::empty(py); let mut reader = BufReader::new(reader); let stanzas = bazaar::rio::read_stanzas(&mut reader).map_err(|e| match e { bazaar::rio::Error::Io(e) => { PyErr::new::(format!("Error reading stanza file: {}", e)) } _ => PyErr::new::("Error reading stanza file: ".to_string()), })?; for stanza in stanzas { ret.append(Stanza { stanza })?; } Ok(ret.into()) }) } #[pyclass] struct RioReader { reader: bazaar::rio::RioReader>, } #[pymethods] impl RioReader { #[new] fn new(file: Py) -> PyResult { let reader = PyBinaryFile::from(file); let reader = BufReader::new(reader); let reader = bazaar::rio::RioReader::new(reader); Ok(RioReader { reader }) } fn __iter__<'a>(&mut self, py: Python<'a>) -> PyResult> { let ret = PyList::empty(py); for stanza in self.reader.iter() { let stanza = stanza.map_err(|e| match e { bazaar::rio::Error::Io(e) => { PyErr::new::(format!("Error reading stanza file: {}", e)) } _ => PyErr::new::("Error reading stanza file: ".to_string()), })?; ret.append(Stanza { stanza: stanza.unwrap(), })?; } PyIterator::from_object(&ret) } } #[pyfunction] #[pyo3(signature = (stanzas, header = None))] fn rio_iter<'a>( py: Python<'a>, stanzas: &'a Bound<'a, PyAny>, header: Option>, ) -> PyResult> { let ret = PyList::empty(py); let pyiter = stanzas.try_iter()?; let mut stanzas = Vec::new(); for stanza in pyiter { let stanza = stanza?; stanzas.push(stanza.extract::()?.stanza); } for line in bazaar::rio::rio_iter(stanzas.into_iter(), header) { let line = line.as_slice(); ret.append(PyBytes::new(py, line))?; } PyIterator::from_object(&ret) } pub(crate) fn rio(m: &Bound) -> PyResult<()> { m.add_wrapped(wrap_pyfunction!(valid_tag))?; m.add_wrapped(wrap_pyfunction!(read_stanza))?; m.add_wrapped(wrap_pyfunction!(read_stanza_file))?; m.add_wrapped(wrap_pyfunction!(read_stanzas))?; m.add_wrapped(wrap_pyfunction!(rio_iter))?; m.add_class::()?; m.add_class::()?; m.add_class::()?; Ok(()) } bzrformats_3.4.0.orig/crates/bazaar-py/src/smart.rs0000644000000000000000000000144315162074037017317 0ustar00use bazaar::smart::protocol::{ MESSAGE_VERSION_THREE, REQUEST_VERSION_THREE, REQUEST_VERSION_TWO, RESPONSE_VERSION_THREE, RESPONSE_VERSION_TWO, }; use pyo3::prelude::*; use pyo3::types::PyBytes; pub(crate) fn _smart_rs(py: Python) -> PyResult> { let m = PyModule::new(py, "smart")?; m.add("REQUEST_VERSION_TWO", PyBytes::new(py, REQUEST_VERSION_TWO))?; m.add( "REQUEST_VERSION_THREE", PyBytes::new(py, REQUEST_VERSION_THREE), )?; m.add( "RESPONSE_VERSION_TWO", PyBytes::new(py, RESPONSE_VERSION_TWO), )?; m.add( "RESPONSE_VERSION_THREE", PyBytes::new(py, RESPONSE_VERSION_THREE), )?; m.add( "MESSAGE_VERSION_THREE", PyBytes::new(py, MESSAGE_VERSION_THREE), )?; Ok(m) } bzrformats_3.4.0.orig/crates/bazaar-py/src/versionedfile.rs0000644000000000000000000001120515162074037021024 0ustar00use bazaar::versionedfile::{ContentFactory, Key}; use pyo3::prelude::*; use pyo3::types::PyBytes; #[pyclass(subclass)] struct AbstractContentFactory(Box); pyo3::import_exception!(bzrformats.errors, UnavailableRepresentation); #[pymethods] impl AbstractContentFactory { #[getter] fn sha1(&self, py: Python) -> Option> { self.0.sha1().map(|x| PyBytes::new(py, &x).into()) } #[getter] fn key(&self) -> Key { self.0.key() } #[getter] fn parents(&self) -> Option> { self.0.parents() } #[getter] fn storage_kind(&self) -> String { self.0.storage_kind() } #[getter] fn size(&self) -> Option { self.0.size() } fn get_bytes_as(&self, py: Python, storage_kind: &str) -> PyResult> { if self.0.storage_kind() == "absent" { return Err(UnavailableRepresentation::new_err( "Absent content has no bytes".to_string(), )); } match storage_kind { "fulltext" => Ok(PyBytes::new(py, self.0.to_fulltext().as_ref()).into()), "lines" => Ok(self .0 .to_lines() .map(|b| PyBytes::new(py, b.as_ref())) .map(|b| b.unbind().into()) .collect::>>() .into_pyobject(py)? .unbind()), "chunked" => Ok(self .0 .to_chunks() .map(|b| PyBytes::new(py, b.as_ref())) .map(|b| b.unbind().into()) .collect::>>() .into_pyobject(py)? .unbind()), _ => Err(UnavailableRepresentation::new_err(format!( "Unsupported storage kind: {}", storage_kind ))), } } fn map_key(&mut self, py: Python, cb: Py) -> PyResult<()> { self.0 .map_key(&|k| cb.call1(py, (k,)).unwrap().extract::(py).unwrap()); Ok(()) } } #[pyclass(extends=AbstractContentFactory)] struct FulltextContentFactory; #[pymethods] impl FulltextContentFactory { #[new] #[pyo3(signature = (key, parents, sha1, text))] fn new( key: Key, parents: Option>, sha1: Option>, text: Vec, ) -> PyResult<(Self, AbstractContentFactory)> { let of = bazaar::versionedfile::FulltextContentFactory::new(sha1, key, parents, text); Ok((FulltextContentFactory, AbstractContentFactory(Box::new(of)))) } } #[pyclass(extends=AbstractContentFactory)] struct ChunkedContentFactory; #[pymethods] impl ChunkedContentFactory { #[new] #[pyo3(signature = (key, parents, sha1, chunks))] fn new( key: Key, parents: Option>, sha1: Option>, chunks: Vec>, ) -> PyResult<(Self, AbstractContentFactory)> { let of = bazaar::versionedfile::ChunkedContentFactory::new(sha1, key, parents, chunks); Ok((ChunkedContentFactory, AbstractContentFactory(Box::new(of)))) } } #[pyfunction] pub fn record_to_fulltext_bytes(py: Python, record: Py) -> PyResult> { let record = record.extract::(py)?; let mut s = Vec::new(); bazaar::versionedfile::record_to_fulltext_bytes(record, &mut s)?; Ok(PyBytes::new(py, &s).into()) } #[pyclass(extends=AbstractContentFactory)] struct AbsentContentFactory; #[pymethods] impl AbsentContentFactory { #[new] fn new(key: Key) -> PyResult<(Self, AbstractContentFactory)> { let of = bazaar::versionedfile::AbsentContentFactory::new(key); Ok((AbsentContentFactory, AbstractContentFactory(Box::new(of)))) } } #[pyfunction] fn fulltext_network_to_record<'a>( py: Python<'a>, _kind: &'a str, bytes: &'a [u8], line_end: usize, ) -> Vec> { let record = bazaar::versionedfile::fulltext_network_to_record(bytes, line_end); let sub = PyClassInitializer::from(AbstractContentFactory(Box::new(record))) .add_subclass(FulltextContentFactory); vec![Bound::new(py, sub).unwrap()] } pub(crate) fn _versionedfile_rs(py: Python) -> PyResult> { let m = PyModule::new(py, "versionedfile")?; m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_function(wrap_pyfunction!(record_to_fulltext_bytes, &m)?)?; m.add_function(wrap_pyfunction!(fulltext_network_to_record, &m)?)?; Ok(m) } bzrformats_3.4.0.orig/crates/bazaar/Cargo.toml0000644000000000000000000000160415162074037016335 0ustar00[package] name = "bazaar" version = { workspace = true } authors = [ "Martin Packman ", "Jelmer Vernooij "] edition = "2018" description = "Rust implementation of the Bazaar formats and protocols" license = "GPL-2.0+" homepage = "https://www.breezy-vcs.org/" repository = "https://github.com/breezy-team/bazaar-rs" [lib] [dependencies] osutils = { path = "../osutils-rs" } lazy_static = "1.4.0" regex = "1.3.1" fancy-regex = ">=0.7" chrono = { workspace = true } bendy = "0.3" xmltree = "0.11" sha1 = "0.10" tempfile = "3" log = "0.4" pyo3 = { version = ">=0.17", optional = true } crc32fast = "1.2.0" base64 = "0.22.1" maplit = "1.0.2" lazy-regex = "3.4.0" byteorder = "1.5.0" lru = "0.13.0" flate2 = "1.0.28" xz2 = "0.1.7" [target.'cfg(unix)'.dependencies] nix = { workspace = true, features = ["fs"] } [features] default = ["pyo3"] pyo3 = ["dep:pyo3"] bzrformats_3.4.0.orig/crates/bazaar/README.md0000644000000000000000000000021715162074037015663 0ustar00This crate contains a rust implementation of the [Bazaar](https://www.bazaar-vcs.org/) file formats and protocols. It's currently incomplete. bzrformats_3.4.0.orig/crates/bazaar/src/0000755000000000000000000000000015162074037015173 5ustar00bzrformats_3.4.0.orig/crates/bazaar/src/bencode_serializer.rs0000644000000000000000000003113215162074037021371 0ustar00use crate::revision::Revision; use crate::serializer::{Error, RevisionSerializer}; use crate::RevisionId; use bendy::decoding::Object; use bendy::encoding::Encoder; use std::io::BufRead; use std::io::Read; pub struct BEncodeRevisionSerializer1; impl RevisionSerializer for BEncodeRevisionSerializer1 { fn format_name(&self) -> &'static str { "10" } fn squashes_xml_invalid_characters(&self) -> bool { false } fn write_revision_to_string(&self, rev: &Revision) -> std::result::Result, Error> { let mut e = Encoder::new(); e.emit_list(|e| { e.emit_list(|e| { e.emit_bytes(b"format")?; e.emit_int(10)?; Ok(()) })?; if let Some(committer) = rev.committer.as_ref() { e.emit_list(|e| { e.emit_bytes(b"committer")?; e.emit_bytes(committer.as_bytes())?; Ok(()) })?; } if let Some(timezone) = rev.timezone { e.emit_list(|e| { e.emit_bytes(b"timezone")?; e.emit_int(timezone)?; Ok(()) })?; } e.emit_list(|e| { e.emit_bytes(b"properties")?; e.emit_dict(|mut e| { let mut keys = rev.properties.keys().collect::>(); keys.sort_by_key(|k| k.as_bytes()); for k in keys { let v = rev.properties.get(k).unwrap(); e.emit_pair_with(k.as_bytes(), |e| { e.emit_bytes(v)?; Ok(()) })?; } Ok(()) })?; Ok(()) })?; e.emit_list(|e| { e.emit_bytes(b"timestamp")?; e.emit_bytes(format!("{:.3}", rev.timestamp).as_bytes())?; Ok(()) })?; e.emit_list(|e| { e.emit_bytes(b"revision-id")?; e.emit_bytes(rev.revision_id.0.as_slice())?; Ok(()) })?; e.emit_list(|e| { e.emit_bytes(b"parent-ids")?; e.emit_list(|e| { for p in rev.parent_ids.iter() { e.emit_bytes(p.0.as_slice())?; } Ok(()) })?; Ok(()) })?; if let Some(inventory_sha1) = rev.inventory_sha1.as_ref() { e.emit_list(|e| { e.emit_bytes(b"inventory-sha1")?; e.emit_bytes(inventory_sha1.as_slice())?; Ok(()) })?; } e.emit_list(|e| { e.emit_bytes(b"message")?; e.emit_bytes(rev.message.as_bytes())?; Ok(()) })?; Ok(()) }) .map_err(|e| Error::EncodeError(format!("failed to encode revision: {}", e)))?; e.get_output() .map_err(|e| Error::EncodeError(format!("failed to encode revision: {}", e))) } fn write_revision_to_lines( &self, rev: &Revision, ) -> Box, Error>>> { let buf = self.write_revision_to_string(rev); if let Err(e) = buf { return Box::new(std::iter::once(Err(e))); } let buf = buf.unwrap(); let mut cursor = std::io::Cursor::new(buf); Box::new(std::iter::from_fn(move || { let mut buf = Vec::new(); if let Err(e) = cursor.read_until(b'\n', &mut buf) { return Some(Err(Error::EncodeError(format!( "failed to encode revision: {}", e )))); } if buf.is_empty() { None } else { Some(Ok(buf)) } })) } fn read_revision_from_string(&self, text: &[u8]) -> std::result::Result { let mut decoder = bendy::decoding::Decoder::new(text); let mut d = if let Some(Object::List(d)) = decoder .next_object() .map_err(|e| Error::DecodeError(format!("failed to decode bencode: {}", e)))? { d } else { return Err(Error::DecodeError("expected dict".to_string())); }; let mut timestamp = None; let mut timezone = None; let mut committer = None; let mut properties = None; let mut message = None; let mut parent_ids = None; let mut revision_id = None; let mut inventory_sha1 = None; while let Some(entry) = d .next_object() .map_err(|e| Error::DecodeError(format!("failed to decode bencode: {}", e)))? { let mut tuple = entry.list_or_else(|_| Err(Error::DecodeError("expected tuple".to_string())))?; let key = tuple .next_object() .map_err(|e| Error::DecodeError(format!("expected tuple with key: {}", e)))? .ok_or_else(|| Error::DecodeError("expected tuple with key".to_string()))? .bytes_or_else(|_| { Err(Error::DecodeError("expected tuple with key".to_string())) })?; let value = tuple .next_object() .map_err(|e| Error::DecodeError(format!("expected tuple with value: {}", e)))? .ok_or_else(|| Error::DecodeError("expected tuple with value".to_string()))?; match key { b"format" => { if value .integer_or(Err(Error::DecodeError("invalid format".to_string())))? .parse::() .map_err(|e| Error::DecodeError(format!("invalid format: {}", e)))? != 10 { return Err(Error::DecodeError("invalid format".to_string())); } } b"timezone" => { timezone = Some( value .integer_or(Err(Error::DecodeError("invalid timezone".to_string())))? .parse() .map_err(|e| Error::DecodeError(format!("invalid timezone: {}", e)))?, ); } b"timestamp" => { timestamp = Some( String::from_utf8( value .bytes_or(Err(Error::DecodeError("invalid timestamp".to_string())))? .to_vec(), ) .map_err(|e| Error::DecodeError(format!("invalid timestamp: {}", e)))? .parse::() .map_err(|e| Error::DecodeError(format!("invalid timestamp: {}", e)))?, ); } b"committer" => { committer = Some( String::from_utf8( value .bytes_or(Err(Error::DecodeError("invalid committer".to_string())))? .to_vec(), ) .map_err(|e| Error::DecodeError(format!("invalid committer: {}", e)))?, ); } b"parent-ids" => { let mut ps = value.list_or(Err(Error::DecodeError("invalid parent_ids".to_string())))?; let mut gs = Vec::new(); while let Some(o) = ps.next_object().map_err(|e| { Error::DecodeError(format!("failed to decode bencode: {}", e)) })? { let p = RevisionId::from( o.bytes_or(Err(Error::DecodeError("invalid parent_id".to_string())))?, ); gs.push(p); } parent_ids = Some(gs); } b"revision-id" => { revision_id = Some(RevisionId::from( value .bytes_or(Err(Error::DecodeError("invalid revision_id".to_string())))?, )); } b"inventory-sha1" => { inventory_sha1 = Some( value .bytes_or(Err(Error::DecodeError( "invalid inventory_sha1".to_string(), )))? .to_vec(), ); } b"properties" => { properties = Some( value .dictionary_or_else(|_| { Err(Error::DecodeError("invalid properties".to_string())) }) .map(|mut d| { let mut ps = std::collections::HashMap::new(); while let Some((k, v)) = d.next_pair().map_err(|e| { Error::DecodeError(format!( "failed to decode bencode: {}", e )) })? { let v = v .bytes_or(Err(Error::DecodeError(format!( "invalid property {}", String::from_utf8_lossy(k) ))))? .to_vec(); let k = String::from_utf8(k.to_vec()).map_err(|e| { Error::DecodeError(format!( "invalid property {}: {}", String::from_utf8_lossy(k), e )) })?; ps.insert(k, v); } Ok::< std::collections::HashMap>, Error, >(ps) })??, ); } b"message" => { message = Some( String::from_utf8( value .bytes_or(Err(Error::DecodeError("invalid message".to_string())))? .to_vec(), ) .map_err(|e| Error::DecodeError(format!("invalid message: {}", e)))?, ); } _ => { return Err(Error::DecodeError(format!( "unknown key {}", String::from_utf8_lossy(key) ))); } } if tuple .next_object() .map_err(|e| Error::DecodeError(format!("expected tuple: {}", e)))? .is_some() { return Err(Error::DecodeError("extra item in tuple".to_string())); } } Ok(Revision::new( revision_id.ok_or(Error::DecodeError("missing revision_id".to_string()))?, parent_ids.ok_or(Error::DecodeError("missing parent_ids".to_string()))?, committer, message.ok_or(Error::DecodeError("missing message".to_string()))?, properties.ok_or(Error::DecodeError("missing properties".to_string()))?, inventory_sha1, timestamp.ok_or(Error::DecodeError("missing timestamp".to_string()))?, timezone, )) } fn read_revision(&self, f: &mut dyn Read) -> std::result::Result { let mut buf = Vec::new(); f.read_to_end(&mut buf).map_err(Error::IOError)?; self.read_revision_from_string(&buf) } } #[allow(dead_code)] const BENCODE_REVISION_SERIALIZER_V1: BEncodeRevisionSerializer1 = BEncodeRevisionSerializer1 {}; bzrformats_3.4.0.orig/crates/bazaar/src/chk_inventory.rs0000644000000000000000000001251515162074037020427 0ustar00use crate::inventory::Entry; /// Serialise entry as a single bytestring. /// /// :param Entry: An inventory entry. /// :return: A bytestring for the entry. /// /// The BNF: /// ENTRY ::= FILE | DIR | SYMLINK | TREE /// FILE ::= "file: " COMMON SEP SHA SEP SIZE SEP EXECUTABLE /// DIR ::= "dir: " COMMON /// SYMLINK ::= "symlink: " COMMON SEP TARGET_UTF8 /// TREE ::= "tree: " COMMON REFERENCE_REVISION /// COMMON ::= FILE_ID SEP PARENT_ID SEP NAME_UTF8 SEP REVISION /// SEP ::= "\n" pub fn chk_inventory_entry_to_bytes(entry: &Entry) -> Vec { let ts; let (header, mut lines) = match entry { Entry::File { name, executable, revision, text_sha1, text_size, parent_id, .. } => { ts = format!("{}", text_size.expect("no text size set")); ( &b"file"[..], vec![ parent_id.as_bytes(), name.as_bytes(), revision.as_ref().expect("no revision set").as_bytes(), text_sha1.as_ref().expect("no text sha1 set").as_slice(), ts.as_bytes(), if *executable { b"Y" } else { b"N" }, ], ) } Entry::Directory { revision, name, parent_id, .. } => ( &b"dir"[..], vec![ parent_id.as_bytes(), name.as_bytes(), revision.as_ref().expect("no revision set").as_bytes(), ], ), Entry::Root { revision, .. } => ( &b"dir"[..], vec![ &b""[..], &b""[..], revision.as_ref().expect("no revision set").as_bytes(), ], ), Entry::Link { name, revision, symlink_target, parent_id, .. } => ( &b"symlink"[..], vec![ parent_id.as_bytes(), name.as_bytes(), revision.as_ref().expect("no revision set").as_bytes(), symlink_target .as_ref() .expect("no symlink target set") .as_bytes(), ], ), Entry::TreeReference { revision, name, reference_revision, parent_id, .. } => ( &b"tree"[..], vec![ parent_id.as_bytes(), name.as_bytes(), revision.as_ref().expect("no revision set").as_bytes(), reference_revision .as_ref() .expect("no reference revision set") .as_bytes(), ], ), }; let header = [header, b": ", entry.file_id().as_bytes()].concat(); lines.insert(0, header.as_slice()); lines.join(&b"\n"[..]) } pub fn chk_inventory_bytes_to_entry(data: &[u8]) -> Entry { let sections = data.split(|&c| c == b'\n').collect::>(); let sp: Vec<&[u8]> = sections[0].splitn(2, |&c| c == b':').collect(); assert!(&sp[1][..1] == b" "); let kind = sp[0]; let file_id = crate::FileId::from(&sp[1][1..]); let name = String::from_utf8(sections[2].to_vec()).unwrap(); let parent_id = if sections[1].is_empty() { None } else { Some(crate::FileId::from(sections[1])) }; let revision = Some(crate::RevisionId::from(sections[3])); match String::from_utf8(kind.to_vec()).unwrap().as_str() { "file" => Entry::File { name, file_id, parent_id: parent_id.unwrap(), text_sha1: Some(sections[4].to_vec()), text_size: Some( String::from_utf8(sections[5].to_vec()) .unwrap() .parse() .unwrap(), ), executable: sections[6] == b"Y", revision, text_id: None, }, "dir" => { if let Some(parent_id) = parent_id { Entry::Directory { name, file_id, parent_id, revision, } } else { Entry::Root { file_id, revision } } } "symlink" => Entry::Link { name, file_id, parent_id: parent_id.unwrap(), symlink_target: Some(String::from_utf8(sections[4].to_vec()).unwrap()), revision, }, "tree" => Entry::TreeReference { name, file_id, parent_id: parent_id.unwrap(), reference_revision: Some(crate::RevisionId::from(sections[4])), revision, }, _ => { panic!("Invalid inventory entry"); } } } pub fn chk_inventory_bytes_to_utf8_name_key( data: &[u8], ) -> (&[u8], crate::FileId, crate::RevisionId) { let sections = data.split(|&c| c == b'\n').collect::>(); let sp: Vec<&[u8]> = sections[0].splitn(2, |&c| c == b':').collect(); assert!(&sp[1][..1] == b" "); let file_id = crate::FileId::from(&sp[1][1..]); let revision = crate::RevisionId::from(sections[3]); (sections[2], file_id, revision) } bzrformats_3.4.0.orig/crates/bazaar/src/chk_map.rs0000644000000000000000000001415315162074037017147 0ustar00//! Persistent maps from tuple_of_strings->string using CHK stores. //! //! Overview and current status: //! //! The CHKMap class implements a dict from tuple_of_strings->string by using a trie //! with internal nodes of 8-bit fan out; The key tuples are mapped to strings by //! joining them by \x00, and \x00 padding shorter keys out to the length of the //! longest key. Leaf nodes are packed as densely as possible, and internal nodes //! are all an additional 8-bits wide leading to a sparse upper tree. //! //! Updates to a CHKMap are done preferentially via the apply_delta method, to //! allow optimisation of the update operation; but individual map/unmap calls are //! possible and supported. Individual changes via map/unmap are buffered in memory //! until the _save method is called to force serialisation of the tree. //! apply_delta records its changes immediately by performing an implicit _save. //! //! # Todo //! //! Densely packed upper nodes. use crc32fast::Hasher; use std::fmt::Write; use std::hash::Hash; use std::iter::zip; fn crc32(bit: &[u8]) -> u32 { let mut hasher = Hasher::new(); hasher.update(bit); hasher.finalize() } pub type SerialisedKey = Vec; pub type SearchKeyFn = fn(&Key) -> SerializedKey; /// Map the key tuple into a search string that just uses the key bytes. pub fn search_key_plain(key: &Key) -> SerializedKey { key.0.join(&b'\x00') } pub fn search_key_16(key: &Key) -> SerializedKey { let mut result = String::new(); for bit in key.iter() { write!(&mut result, "{:08X}\x00", crc32(bit)).unwrap(); } result.pop(); result.as_bytes().to_vec() } pub fn search_key_255(key: &Key) -> SerializedKey { let mut result = vec![]; for bit in key.iter() { let crc = crc32(bit); let crc_bytes = crc.to_be_bytes(); result.extend(crc_bytes); result.push(0x00); } result.pop(); result .iter() .map(|b| if *b == 0x0A { b'_' } else { *b }) .collect() } pub fn bytes_to_text_key(data: &[u8]) -> Result<(&[u8], &[u8]), String> { let sections: Vec<&[u8]> = data.split(|&byte| byte == b'\n').collect(); let delimiter_position = sections[0].windows(2).position(|window| window == b": "); if delimiter_position.is_none() { return Err("Invalid key file".to_string()); } let (_kind, file_id) = sections[0].split_at(delimiter_position.unwrap() + 2); Ok((file_id, sections[3])) } #[derive(Debug, Hash, PartialEq, Eq, Clone)] pub struct Key(Vec>); impl From>> for Key { fn from(v: Vec>) -> Self { Key(v) } } impl Key { pub fn serialize(&self) -> SerializedKey { let mut result = vec![]; for bit in self.0.iter() { result.extend(bit); result.push(0x00); } result.pop(); result } #[allow(clippy::len_without_is_empty)] pub fn len(&self) -> usize { self.0.len() } pub fn iter(&self) -> impl Iterator { self.0.iter().map(|v| v.as_slice()) } } impl std::ops::Index for Key { type Output = Vec; fn index(&self, index: usize) -> &Self::Output { &self.0[index] } } impl std::fmt::Display for Key { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { let mut first = true; for bit in &self.0 { if !first { write!(f, "/")?; } first = false; write!(f, "{}", String::from_utf8_lossy(bit))?; } Ok(()) } } pub type SerializedKey = Vec; pub type Value = Vec; pub enum Error { InconsistentDeltaDelta(Vec<(Option, Option, Value)>, String), DeserializeError(String), } impl From for Error { fn from(e: std::num::ParseIntError) -> Self { Error::DeserializeError(format!("Failed to parse int: {}", e)) } } /// Given 2 strings, return the longest prefix common to both. /// /// # Arguments /// * `prefix` - This has been the common prefix for other keys, so it is more likely to be the common prefix in this case as well. /// * `key` - Another string to compare to. pub fn common_prefix_pair<'b>(prefix: &[u8], key: &'b [u8]) -> &'b [u8] { if key.starts_with(prefix) { return &key[..prefix.len()]; } let mut p = 0; // Is there a better way to do this? for (left, right) in zip(prefix, key) { if left != right { break; } p += 1; } let p = p as usize; &key[..p] } #[test] fn test_common_prefix_pair() { assert_eq!(common_prefix_pair(b"abc", b"abc"), b"abc"); assert_eq!(common_prefix_pair(b"abc", b"abcd"), b"abc"); assert_eq!(common_prefix_pair(b"abc", b"ab"), b"ab"); assert_eq!(common_prefix_pair(b"abc", b"bbd"), b""); assert_eq!(common_prefix_pair(b"", b"bbc"), b""); assert_eq!(common_prefix_pair(b"abc", b""), b""); } /// Given a list of keys, find their common prefix. /// /// # Arguments /// * `keys`: An iterable of strings. /// /// # Returns /// The longest common prefix of all keys. pub fn common_prefix_many<'a>(mut keys: impl Iterator + 'a) -> Option<&'a [u8]> { let mut cp = keys.next()?; for key in keys { cp = common_prefix_pair(cp, key); if cp.is_empty() { // if common_prefix is the empty string, then we know it won't // change further break; } } Some(cp) } #[test] fn test_common_prefix_many() { assert_eq!( common_prefix_many(vec![&b"abc"[..], &b"abc"[..]].into_iter()), Some(&b"abc"[..]) ); assert_eq!( common_prefix_many(vec![&b"abc"[..], &b"abcd"[..]].into_iter()), Some(&b"abc"[..]) ); assert_eq!( common_prefix_many(vec![&b"abc"[..], &b"ab"[..]].into_iter()), Some(&b"ab"[..]) ); assert_eq!( common_prefix_many(vec![&b"abc"[..], &b"bbd"[..]].into_iter()), Some(&b""[..]) ); assert_eq!( common_prefix_many(vec![&b"abcd"[..], &b"abc"[..], &b"abc"[..]].into_iter()), Some(&b"abc"[..]) ); assert_eq!(common_prefix_many(vec![].into_iter()), None); } bzrformats_3.4.0.orig/crates/bazaar/src/dirstate.rs0000644000000000000000000003244615162210136017361 0ustar00use crate::inventory::Entry as InventoryEntry; use crate::FileId; use base64::engine::Engine; use osutils::sha::{sha_file, sha_file_by_name}; use std::cmp::Ordering; use std::collections::HashMap; use std::fs::File; use std::fs::Metadata; #[cfg(unix)] use std::os::unix::fs::MetadataExt; use std::path::Path; pub trait SHA1Provider: Send + Sync { fn sha1(&self, path: &Path) -> std::io::Result; fn stat_and_sha1(&self, path: &Path) -> std::io::Result<(Metadata, String)>; } /// A SHA1Provider that reads directly from the filesystem.""" pub struct DefaultSHA1Provider; impl DefaultSHA1Provider { pub fn new() -> DefaultSHA1Provider { DefaultSHA1Provider {} } } impl Default for DefaultSHA1Provider { fn default() -> Self { Self::new() } } impl SHA1Provider for DefaultSHA1Provider { /// Return the sha1 of a file given its absolute path. fn sha1(&self, path: &Path) -> std::io::Result { sha_file_by_name(path) } /// Return the stat and sha1 of a file given its absolute path. fn stat_and_sha1(&self, path: &Path) -> std::io::Result<(Metadata, String)> { let mut f = File::open(path)?; let stat = f.metadata()?; let sha1 = sha_file(&mut f)?; Ok((stat, sha1)) } } pub fn lt_by_dirs(path1: &Path, path2: &Path) -> bool { let path1_parts = path1.components(); let path2_parts = path2.components(); let mut path1_parts_iter = path1_parts; let mut path2_parts_iter = path2_parts; loop { match (path1_parts_iter.next(), path2_parts_iter.next()) { (None, None) => return false, (None, Some(_)) => return true, (Some(_), None) => return false, (Some(part1), Some(part2)) => match part1.cmp(&part2) { Ordering::Equal => continue, Ordering::Less => return true, Ordering::Greater => return false, }, } } } pub fn lt_path_by_dirblock(path1: &Path, path2: &Path) -> bool { let key1 = (path1.parent(), path1.file_name()); let key2 = (path2.parent(), path2.file_name()); key1 < key2 } pub fn bisect_path_left(paths: &[&Path], path: &Path) -> usize { let mut hi = paths.len(); let mut lo = 0; while lo < hi { let mid = (lo + hi) / 2; // Grab the dirname for the current dirblock let cur = paths[mid]; if lt_path_by_dirblock(cur, path) { lo = mid + 1; } else { hi = mid; } } lo } pub fn bisect_path_right(paths: &[&Path], path: &Path) -> usize { let mut hi = paths.len(); let mut lo = 0; while lo < hi { let mid = (lo + hi) / 2; // Grab the dirname for the current dirblock let cur = paths[mid]; if lt_path_by_dirblock(path, cur) { hi = mid; } else { lo = mid + 1; } } lo } #[cfg(unix)] pub fn pack_stat_metadata(metadata: &Metadata) -> String { pack_stat( metadata.len(), metadata .modified() .unwrap() .duration_since(std::time::UNIX_EPOCH) .unwrap() .as_secs(), metadata .created() .unwrap() .duration_since(std::time::UNIX_EPOCH) .unwrap() .as_secs(), metadata.dev(), metadata.ino(), metadata.mode(), ) } #[cfg(windows)] pub fn pack_stat_metadata(metadata: &Metadata) -> String { pack_stat( metadata.len(), metadata .modified() .unwrap() .duration_since(std::time::UNIX_EPOCH) .unwrap() .as_secs(), metadata .created() .unwrap() .duration_since(std::time::UNIX_EPOCH) .unwrap() .as_secs(), 0, 0, 0, ) } pub fn pack_stat(size: u64, mtime: u64, ctime: u64, dev: u64, ino: u64, mode: u32) -> String { let size = size & 0xFFFFFFFF; let mtime = mtime & 0xFFFFFFFF; let ctime = ctime & 0xFFFFFFFF; let dev = dev & 0xFFFFFFFF; let ino = ino & 0xFFFFFFFF; let packed_data = [ (size >> 24) as u8, (size >> 16) as u8, (size >> 8) as u8, size as u8, (mtime >> 24) as u8, (mtime >> 16) as u8, (mtime >> 8) as u8, mtime as u8, (ctime >> 24) as u8, (ctime >> 16) as u8, (ctime >> 8) as u8, ctime as u8, (dev >> 24) as u8, (dev >> 16) as u8, (dev >> 8) as u8, dev as u8, (ino >> 24) as u8, (ino >> 16) as u8, (ino >> 8) as u8, ino as u8, (mode >> 24) as u8, (mode >> 16) as u8, (mode >> 8) as u8, mode as u8, ]; base64::engine::general_purpose::STANDARD_NO_PAD.encode(packed_data) } pub fn stat_to_minikind(metadata: &Metadata) -> char { let file_type = metadata.file_type(); if file_type.is_dir() { 'd' } else if file_type.is_file() { 'f' } else if file_type.is_symlink() { 'l' } else { panic!("Unsupported file type"); } } pub const HEADER_FORMAT_2: &[u8] = b"#bazaar dirstate flat format 2\n"; pub const HEADER_FORMAT_3: &[u8] = b"#bazaar dirstate flat format 3\n"; #[derive(PartialEq, Eq, Debug)] pub enum Kind { Absent, File, Directory, Relocated, Symlink, TreeReference, } impl Kind { pub fn to_char(&self) -> char { match self { Kind::Absent => 'a', Kind::File => 'f', Kind::Directory => 'd', Kind::Relocated => 'r', Kind::Symlink => 'l', Kind::TreeReference => 't', } } pub fn to_byte(&self) -> u8 { self.to_char() as u8 } pub fn to_str(&self) -> &str { match self { Kind::Absent => "absent", Kind::File => "file", Kind::Directory => "directory", Kind::Relocated => "relocated", Kind::Symlink => "symlink", Kind::TreeReference => "tree-reference", } } } impl From for Kind { fn from(k: osutils::Kind) -> Self { match k { osutils::Kind::File => Kind::File, osutils::Kind::Directory => Kind::Directory, osutils::Kind::Symlink => Kind::Symlink, osutils::Kind::TreeReference => Kind::TreeReference, } } } impl ToString for Kind { fn to_string(&self) -> String { self.to_str().to_string() } } impl From for Kind { fn from(s: String) -> Self { match s.as_str() { "absent" => Kind::Absent, "file" => Kind::File, "directory" => Kind::Directory, "relocated" => Kind::Relocated, "symlink" => Kind::Symlink, "tree-reference" => Kind::TreeReference, _ => panic!("Unknown kind: {}", s), } } } impl From for Kind { fn from(c: char) -> Self { match c { 'a' => Kind::Absent, 'f' => Kind::File, 'd' => Kind::Directory, 'r' => Kind::Relocated, 'l' => Kind::Symlink, 't' => Kind::TreeReference, _ => panic!("Unknown kind: {}", c), } } } pub enum YesNo { Yes, No, } /// _header_state and _dirblock_state represent the current state /// of the dirstate metadata and the per-row data respectiely. /// In future we will add more granularity, for instance _dirblock_state /// will probably support partially-in-memory as a separate variable, /// allowing for partially-in-memory unmodified and partially-in-memory /// modified states. #[derive(PartialEq, Eq, Debug)] pub enum MemoryState { /// indicates that no data is in memory NotInMemory, /// indicates that what we have in memory is the same as is on disk InMemoryUnmodified, /// indicates that we have a modified version of what is on disk. InMemoryModified, InMemoryHashModified, } pub fn fields_per_entry(num_present_parents: usize) -> usize { // How many null separated fields should be in each entry row. // // Each line now has an extra '\n' field which is not used // so we just skip over it // // entry size: // 3 fields for the key // + number of fields per tree_data (5) * tree count // + newline let tree_count = 1 + num_present_parents; 3 + 5 * tree_count + 1 } pub fn get_ghosts_line(ghost_ids: &[&[u8]]) -> Vec { // Create a line for the state file for ghost information. let mut entries = Vec::new(); let l = format!("{}", ghost_ids.len()); entries.push(l.as_bytes()); entries.extend_from_slice(ghost_ids); entries.join(&b"\0"[..]) } pub fn get_parents_line(parent_ids: &[&[u8]]) -> Vec { // Create a line for the state file for parents information. let mut entries = Vec::new(); let l = format!("{}", parent_ids.len()); entries.push(l.as_bytes()); entries.extend_from_slice(parent_ids); entries.join(&b"\0"[..]) } pub struct IdIndex { id_index: HashMap, Vec, FileId)>>, } impl Default for IdIndex { fn default() -> Self { Self::new() } } impl IdIndex { pub fn new() -> Self { IdIndex { id_index: HashMap::new(), } } pub fn add(&mut self, entry_key: (&[u8], &[u8], &FileId)) { // Add this entry to the _id_index mapping. // // This code used to use a set for every entry in the id_index. However, // it is *rare* to have more than one entry. So a set is a large // overkill. And even when we do, we won't ever have more than the // number of parent trees. Which is still a small number (rarely >2). As // such, we use a simple vector, and do our own uniqueness checks. While // the 'contains' check is O(N), since N is nicely bounded it shouldn't ever // cause quadratic failure. let file_id = entry_key.2; let entry_keys = self.id_index.entry(file_id.clone()).or_default(); entry_keys.push((entry_key.0.to_vec(), entry_key.1.to_vec(), file_id.clone())); } pub fn remove(&mut self, entry_key: (&[u8], &[u8], &FileId)) { // Remove this entry from the _id_index mapping. // // It is a programming error to call this when the entry_key is not // already present. let file_id = entry_key.2; let entry_keys = self.id_index.get_mut(file_id).unwrap(); entry_keys.retain(|key| (key.0.as_slice(), key.1.as_slice(), &key.2) != entry_key); } pub fn get(&self, file_id: &FileId) -> Vec<(Vec, Vec, FileId)> { self.id_index .get(file_id) .map_or_else(Vec::new, |v| v.clone()) } pub fn iter_all(&self) -> impl Iterator, Vec, FileId)> { self.id_index.values().flatten() } pub fn file_ids(&self) -> impl Iterator { self.id_index.keys() } } /// Convert an inventory entry (from a revision tree) to state details. /// /// Args: /// inv_entry: An inventory entry whose sha1 and link targets can be /// relied upon, and which has a revision set. /// Returns: A details tuple - the details for a single tree at a path id. pub fn inv_entry_to_details(e: &InventoryEntry) -> (u8, Vec, u64, bool, Vec) { let minikind = Kind::from(e.kind()).to_byte(); let tree_data = e .revision() .map_or_else(Vec::new, |r| r.as_bytes().to_vec()); let (fingerprint, size, executable) = match e { InventoryEntry::Directory { .. } | InventoryEntry::Root { .. } => (Vec::new(), 0, false), InventoryEntry::File { text_sha1, text_size, executable, .. } => ( text_sha1.as_ref().map_or_else(Vec::new, |f| f.to_vec()), text_size.unwrap_or(0), *executable, ), InventoryEntry::Link { symlink_target, .. } => ( symlink_target .as_ref() .map_or_else(Vec::new, |f| f.as_bytes().to_vec()), 0, false, ), InventoryEntry::TreeReference { reference_revision, .. } => ( reference_revision .as_ref() .map_or_else(Vec::new, |f| f.as_bytes().to_vec()), 0, false, ), }; (minikind, fingerprint, size, executable, tree_data) } fn _crc32(bit: &[u8]) -> u32 { let mut hasher = crc32fast::Hasher::new(); hasher.update(bit); hasher.finalize() } /// Format lines for final output. /// /// Args: /// lines: A sequence of lines containing the parents list and the path lines. pub fn get_output_lines(mut lines: Vec<&[u8]>) -> Vec> { // Format lines for final output. let mut output_lines = vec![HEADER_FORMAT_3]; lines.push(b""); let inventory_text = lines.join(&b"\0\n\0"[..]).to_vec(); let crc32 = _crc32(inventory_text.as_slice()); let crc32_line = format!("crc32: {}\n", crc32).into_bytes(); output_lines.push(crc32_line.as_slice()); let num_entries = lines.len() - 3; let num_entries_line = format!("num_entries: {}\n", num_entries).into_bytes(); output_lines.push(num_entries_line.as_slice()); output_lines.push(inventory_text.as_slice()); output_lines.into_iter().map(|l| l.to_vec()).collect() } bzrformats_3.4.0.orig/crates/bazaar/src/filters.rs0000644000000000000000000000536315162074037017220 0ustar00use osutils::sha::sha_chunks; use std::fs::File; use std::io::Error; use std::io::Read; use std::path::Path; pub type ContentFilterProvider = dyn Fn(&Path, u64) -> Box + Send + Sync; pub trait ContentFilter { fn reader( &self, input: Box, Error>> + Send + Sync>, ) -> Box, Error>> + Send + Sync>; fn writer( &self, input: Box, Error>> + Send + Sync>, ) -> Box, Error>> + Send + Sync>; fn sha1_file(&self, path: &Path) -> Result { let mut file = File::open(path)?; let chunk_iter = std::iter::from_fn(move || { let mut buf = vec![0; 128 << 10]; let bytes_read = file.read(&mut buf); if let Err(e) = bytes_read { return Some(Err(e)); } let bytes_read = bytes_read.unwrap(); if bytes_read == 0 { None } else { buf.truncate(bytes_read); Some(Ok(buf)) } }); let chunk_iter = self.reader(Box::new(chunk_iter)); let mut err = None; let sha1 = sha_chunks(chunk_iter.filter_map(|r| { if let Err(e) = r { err = Some(e); None } else { Some(r.unwrap()) } })); if let Some(err) = err { Err(err) } else { Ok(sha1) } } } pub struct ContentFilterStack { filters: Vec>, } impl From>> for ContentFilterStack { fn from(filters: Vec>) -> Self { Self { filters } } } impl ContentFilterStack { pub fn new() -> Self { Self { filters: Vec::new(), } } pub fn add_filter(&mut self, filter: Box) { self.filters.push(filter); } } impl std::default::Default for ContentFilterStack { fn default() -> Self { Self::new() } } impl ContentFilter for ContentFilterStack { fn reader( &self, input: Box, Error>> + Send + Sync>, ) -> Box, Error>> + Send + Sync> { self.filters .iter() .fold(input, |input, filter| filter.reader(input)) } fn writer( &self, input: Box, Error>> + Send + Sync>, ) -> Box, Error>> + Send + Sync> { self.filters .iter() .fold(input, |input, filter| filter.writer(input)) } } bzrformats_3.4.0.orig/crates/bazaar/src/gen_ids.rs0000644000000000000000000001034615162074037017155 0ustar00use osutils::rand_chars; use lazy_regex::regex; use lazy_static::lazy_static; use regex::bytes::Regex; use std::time::{SystemTime, UNIX_EPOCH}; lazy_static! { // the regex removes any weird characters; we don't escape them // but rather just pull them out static ref FILE_ID_CHARS_RE: Regex = Regex::new(r#"[^\w.]"#).unwrap(); static ref REV_ID_CHARS_RE: Regex = Regex::new(r#"[^-\w.+@]"#).unwrap(); static ref GEN_FILE_ID_SUFFIX: String = gen_file_id_suffix(); } fn gen_file_id_suffix() -> String { let current_time = SystemTime::now() .duration_since(UNIX_EPOCH) .unwrap() .as_secs(); let random_chars = rand_chars(16); format!( "-{}-{}-", osutils::time::compact_date(current_time), random_chars ) } pub fn next_id_suffix(suffix: Option<&str>) -> Vec { static GEN_FILE_ID_SERIAL: std::sync::atomic::AtomicUsize = std::sync::atomic::AtomicUsize::new(0); // XXX TODO: change breezy.add.smart_add_tree to call workingtree.add() rather // than having to move the id randomness out of the inner loop like this. // XXX TODO: for the global randomness this uses we should add the thread-id // before the serial #. // XXX TODO: jam 20061102 I think it would be good to reset every 100 or // 1000 calls, or perhaps if time.time() increases by a certain // amount. time.time() shouldn't be terribly expensive to call, // and it means that long-lived processes wouldn't use the same // suffix forever. let serial = GEN_FILE_ID_SERIAL.fetch_add(1, std::sync::atomic::Ordering::Relaxed); format!( "{}{}", suffix.unwrap_or(GEN_FILE_ID_SUFFIX.as_str()), serial ) .into_bytes() } pub fn gen_file_id(name: &str) -> Vec { // The real randomness is in the _next_id_suffix, the // rest of the identifier is just to be nice. // So we: // 1) Remove non-ascii word characters to keep the ids portable // 2) squash to lowercase, so the file id doesn't have to // be escaped (case insensitive filesystems would bork for ids // that only differ in case without escaping). // 3) truncate the filename to 20 chars. Long filenames also bork on some // filesystems // 4) Removing starting '.' characters to prevent the file ids from // being considered hidden. let name_bytes = name .chars() .filter(|c| c.is_ascii()) .collect::() .to_ascii_lowercase() .as_bytes() .to_vec(); let ascii_word_only = FILE_ID_CHARS_RE .replace_all(&name_bytes, |_: ®ex::bytes::Captures| b"") .to_vec(); let without_dots = ascii_word_only .into_iter() .skip_while(|c| *c == b'.') .collect::>(); let short = without_dots.iter().take(20).cloned().collect::>(); let suffix = next_id_suffix(None); [short, suffix].concat() } pub fn gen_root_id() -> Vec { gen_file_id("tree_root") } fn get_identifier(s: &str) -> Vec { let mut identifier = s.to_string(); if let Some(start) = s.find('<') { let end = s.rfind('>'); if end.is_some() && start < end.unwrap() && end.unwrap() == s.len() - 1 && s[start..].find('@').is_some() { identifier = s[start + 1..end.unwrap()].to_string(); } } let identifier: String = identifier .to_ascii_lowercase() .replace(' ', "_") .chars() .filter(|c| c.is_ascii()) .collect(); REV_ID_CHARS_RE .replace_all(identifier.as_bytes(), |_: ®ex::bytes::Captures| b"") .to_vec() } pub fn gen_revision_id(username: &str, timestamp: Option) -> Vec { let user_or_email = get_identifier(username); // This gives 36^16 ~= 2^82.7 ~= 83 bits of entropy let unique_chunk = osutils::rand_chars(16).as_bytes().to_vec(); let timestamp = timestamp.unwrap_or_else(|| { SystemTime::now() .duration_since(UNIX_EPOCH) .unwrap() .as_secs() }); [ user_or_email, osutils::time::compact_date(timestamp) .as_bytes() .to_vec(), unique_chunk, ] .join(&b'-') } bzrformats_3.4.0.orig/crates/bazaar/src/globbing.rs0000644000000000000000000001023215162074037017322 0ustar00//! Tools for converting globs to regular expressions. //! //! This module provides functions for converting shell-like globs to regular //! expressions. pub use fancy_regex::{Captures, Error, Match, Regex}; use lazy_static::lazy_static; use std::sync::Arc; lazy_static! { static ref SLASHES_RE: Regex = Regex::new(r"[\\/]+").unwrap(); static ref EXPAND_RE: Regex = Regex::new("\\\\&").unwrap(); } /// Converts backslashes in path patterns to forward slashes. /// Doesn't normalize regular expressions - they may contain escapes. pub fn normalize_pattern(pattern: &str) -> String { let mut pattern = pattern.to_string(); if !(pattern.starts_with("RE:") || pattern.starts_with("!RE:")) { pattern = SLASHES_RE.replace_all(pattern.as_str(), "/").to_string(); } if pattern.len() > 1 { pattern = pattern.trim_end_matches('/').to_string(); } pattern } pub enum Replacement { String(String), Function(fn(&str) -> String), Closure(Box String + Sync + Send>), } // TODO(jelmer): Consider using RegexSet from the regex crate instead. /// Do a multiple-pattern substitution. /// /// The patterns and substitutions are combined into one, so the result of /// one replacement is never substituted again. Add the patterns and /// replacements via the add method and then call the object. The patterns /// must not contain capturing groups. pub struct Replacer { compiled: Option, pats: Vec<(String, Arc)>, } impl Replacer { pub fn new(source: Option<&Self>) -> Self { let mut ret = Self::empty(); if let Some(source) = source { ret.add_replacer(source); } ret } pub fn empty() -> Self { Self { compiled: None, pats: Vec::new(), } } /// Add a pattern and replacement. /// /// The pattern must not contain capturing groups. /// The replacement might be either a string template in which \& will be /// replaced with the match, or a function that will get the matching text /// as argument. It does not get match object, because capturing is /// forbidden anyway. pub fn add(&mut self, pat: &str, fun: Replacement) { // Need to recompile self.compiled = None; self.pats.push((pat.to_string(), Arc::new(fun))); } pub fn add_validate(&mut self, pat: &str, fun: Replacement) -> Result<(), Error> { Regex::new(pat)?; self.add(pat, fun); Ok(()) } /// Add all patterns from another replacer. /// /// All patterns and replacements from replacer are appended to the ones /// already defined. pub fn add_replacer(&mut self, replacer: &Replacer) { self.compiled = None; self.pats.extend(replacer.pats.clone()); } pub fn replace(&mut self, text: &str) -> std::result::Result { if self.pats.is_empty() { return Ok(text.to_string()); } if self.compiled.is_none() { let pat_str = self .pats .iter() .map(|(pat, _)| format!("({})", pat)) .collect::>() .join("|"); self.compiled = Some(Regex::new(&pat_str)?); } let pats = &mut self.pats; fn expand(text: &str, rep: &str) -> String { rep.replace("\\&", text) } fn sub(m: &Match, rep: &mut Arc) -> String { let replacement = Arc::get_mut(rep).unwrap(); match replacement { Replacement::String(s) => expand(m.as_str(), s.as_str()), Replacement::Function(f) => f(m.as_str()), Replacement::Closure(f) => f(m.as_str().to_string()), } } Ok(self .compiled .as_ref() .unwrap() .replace_all(text, |caps: &Captures| { for (index, m) in caps.iter().skip(1).enumerate() { if let Some(m) = m { return sub(&m, &mut pats[index].1); } } unreachable!(); }) .to_string()) } } bzrformats_3.4.0.orig/crates/bazaar/src/groupcompress/0000755000000000000000000000000015162074037020103 5ustar00bzrformats_3.4.0.orig/crates/bazaar/src/hashcache.rs0000644000000000000000000003661715162210136017455 0ustar00use crate::filters::{ContentFilter, ContentFilterProvider, ContentFilterStack}; use osutils::sha::sha_string; use log::{debug, info}; use std::collections::HashMap; use std::fs; use std::fs::{File, Metadata, Permissions}; use std::io; use std::io::prelude::*; use std::io::BufReader; #[cfg(unix)] use std::os::unix::fs::MetadataExt; use std::path::{Path, PathBuf}; use std::time::{SystemTime, UNIX_EPOCH}; use tempfile::NamedTempFile; /// TODO: Up-front, stat all files in order and remove those which are deleted or /// out-of-date. Don't actually re-read them until they're needed. That ought /// to bring all the inodes into core so that future stats to them are fast, and /// it preserves the nice property that any caller will always get up-to-date /// data except in unavoidable cases. /// TODO: Perhaps return more details on the file to avoid statting it /// again: nonexistent, file type, size, etc const CACHE_HEADER: &[u8] = b"### bzr hashcache v5\n"; enum FileKind { Regular, Symlink, Other, } fn file_kind(path: &Path) -> FileKind { match fs::symlink_metadata(path) { Ok(meta) => { if meta.is_symlink() { FileKind::Symlink } else if meta.is_file() { FileKind::Regular } else { FileKind::Other } } Err(_) => FileKind::Other, } } /// Cache for looking up file SHA-1. /// /// Files are considered to match the cached value if the fingerprint /// of the file has not changed. This includes its mtime, ctime, /// device number, inode number, and size. This should catch /// modifications or replacement of the file by a new one. /// /// This may not catch modifications that do not change the file's /// size and that occur within the resolution window of the /// timestamps. To handle this we specifically do not cache files /// which have changed since the start of the present second, since /// they could undetectably change again. /// /// This scheme may fail if the machine's clock steps backwards. /// Don't do that. /// /// This does not canonicalize the paths passed in; that should be /// done by the caller. /// /// _cache /// Indexed by path, points to a two-tuple of the SHA-1 of the file. /// and its fingerprint. /// /// stat_count /// number of times files have been statted /// /// hit_count /// number of times files have been retrieved from the cache, avoiding a /// re-read /// /// miss_count /// number of misses (times files have been completely re-read) #[derive(Debug, PartialEq, Default, Clone)] pub struct Fingerprint { pub size: u64, pub mtime: i64, pub ctime: i64, pub ino: u64, pub dev: u64, pub mode: u32, } #[cfg(unix)] impl From for Fingerprint { fn from(meta: Metadata) -> Fingerprint { Fingerprint { size: meta.size(), mtime: meta.mtime(), ctime: meta.ctime(), ino: meta.ino(), dev: meta.dev(), mode: meta.mode(), } } } #[cfg(windows)] impl From for Fingerprint { fn from(meta: Metadata) -> Fingerprint { use std::os::windows::fs::MetadataExt; let mtime = meta .modified() .ok() .and_then(|t| t.duration_since(UNIX_EPOCH).ok()) .map(|d| d.as_secs() as i64) .unwrap_or(0); let ctime = meta .created() .ok() .and_then(|t| t.duration_since(UNIX_EPOCH).ok()) .map(|d| d.as_secs() as i64) .unwrap_or(0); Fingerprint { size: meta.file_size(), mtime, ctime, ino: 0, dev: 0, mode: 0, } } } const DEFAULT_CUTOFF_OFFSET: i64 = -3; pub struct HashCache { root: PathBuf, hit_count: u32, miss_count: u32, stat_count: u32, danger_count: u32, removed_count: u32, update_count: u32, cache: HashMap, needs_write: bool, permissions: Option, cache_file_name: PathBuf, filter_provider: Option>, cutoff_offset: i64, } impl HashCache { /// Create a hash cache in base dir, and set the file mode to mode. /// /// Args: /// content_filter_provider: a function that takes a /// path (relative to the top of the tree) and a file-id as /// parameters and returns a stack of ContentFilters. /// If None, no content filtering is performed. pub fn new( root: &Path, cache_file_name: &Path, permissions: Option, content_filter_provider: Option>, ) -> Self { HashCache { root: root.to_path_buf(), hit_count: 0, miss_count: 0, stat_count: 0, danger_count: 0, removed_count: 0, update_count: 0, cache: HashMap::new(), needs_write: false, permissions, cache_file_name: cache_file_name.to_path_buf(), filter_provider: content_filter_provider, cutoff_offset: DEFAULT_CUTOFF_OFFSET, } } pub fn cache_file_name(&self) -> &Path { self.cache_file_name.as_path() } pub fn hit_count(&self) -> u32 { self.hit_count } pub fn miss_count(&self) -> u32 { self.miss_count } pub fn set_cutoff_offset(&mut self, offset: i64) { self.cutoff_offset = offset; } /// Discard all cached information. /// /// This does not reset the counters. pub fn clear(&mut self) { if !self.cache.is_empty() { self.needs_write = true; self.cache.clear(); } } /// Scan all files and remove entries where the cache entry is obsolete. /// /// Obsolete entries are those where the file has been modified or deleted /// since the entry was inserted. pub fn scan(&mut self) { let mut keys_to_remove = Vec::new(); let mut by_inode = self .cache .iter() .map(|(k, v)| (v.1.ino, k, v)) .collect::>(); by_inode.sort_by_key(|x| x.0); for (_inode, path, cache_val) in by_inode { let abspath = self.root.join(path); let fp = self.fingerprint(abspath.as_ref(), None); self.stat_count += 1; if fp.is_none() || cache_val.1 != fp.unwrap() { // not here or not a regular file anymore self.removed_count += 1; self.needs_write = true; keys_to_remove.push(path.clone()); } } for path in keys_to_remove { self.cache.remove(&path); } } pub fn get_sha1_by_fingerprint( &mut self, path: &Path, file_fp: &Fingerprint, ) -> io::Result { let abspath = self.root.join(path); let (cache_sha1, cache_fp) = self .cache .get(path) .cloned() .unwrap_or((Default::default(), Default::default())); if cache_fp == *file_fp { self.hit_count += 1; Ok(cache_sha1) } else { self.miss_count += 1; match file_kind(&abspath) { FileKind::Regular => { let filters: Box = if let Some(filter_provider) = self.filter_provider.as_ref() { filter_provider(path, file_fp.ctime as u64) } else { Box::new(ContentFilterStack::new()) }; let digest = filters.sha1_file(&abspath)?; // window of 3 seconds to allow for 2s resolution on windows, // unsynchronized file servers, etc. let cutoff = self.cutoff_time(); if file_fp.mtime >= cutoff || file_fp.ctime >= cutoff { // changed too recently; can't be cached. we can // return the result and it could possibly be cached // next time. // // the point is that we only want to cache when we are sure that any // subsequent modifications of the file can be detected. If a // modification neither changes the inode, the device, the size, nor // the mode, then we can only distinguish it by time; therefore we // need to let sufficient time elapse before we may cache this entry // again. If we didn't do this, then, for example, a very quick 1 // byte replacement in the file might go undetected. self.danger_count += 1; if self.cache.remove(path).is_some() { self.removed_count += 1; self.needs_write = true; } } else { self.update_count += 1; self.needs_write = true; self.cache .insert(path.to_owned(), (digest.clone(), file_fp.clone())); } Ok(digest) } FileKind::Symlink => { let target = fs::read_link(&abspath)?; let digest = sha_string(target.to_string_lossy().as_bytes()); self.cache .insert(path.to_owned(), (digest.clone(), file_fp.clone())); self.update_count += 1; self.needs_write = true; Ok(digest) } _ => Err(io::Error::new( io::ErrorKind::InvalidData, format!("unknown file stat mode: {:o}", file_fp.mode), )), } } } /// Return the SHA-1 of the file at path. pub fn get_sha1( &mut self, path: &Path, stat_value: Option, ) -> io::Result> { let abspath = self.root.join(path); self.stat_count += 1; let file_fp = self.fingerprint(abspath.as_ref(), stat_value); if file_fp.is_none() { // not a regular file or not existing if self.cache.remove(path).is_some() { self.removed_count += 1; self.needs_write = true; } Ok(None) } else { Ok(Some(self.get_sha1_by_fingerprint(path, &file_fp.unwrap())?)) } } /// Write contents of cache to file. pub fn write(&mut self) -> Result<(), std::io::Error> { let mut outf = NamedTempFile::new_in(self.cache_file_name.parent().unwrap())?; if let Some(permissions) = self.permissions.clone() { outf.as_file().set_permissions(permissions)?; } outf.write_all(CACHE_HEADER)?; for (path, c) in &self.cache { let mut line_info: Vec = Vec::new(); line_info.extend_from_slice(path.to_str().unwrap().as_bytes()); line_info.extend_from_slice(b"// "); line_info.extend_from_slice(c.0.as_bytes()); line_info.push(b' '); let fp = &c.1; write!( &mut line_info, "{} {} {} {} {} {}", fp.size, fp.mtime, fp.ctime, fp.ino, fp.dev, fp.mode )?; line_info.push(b'\n'); outf.write_all(&line_info)?; } outf.persist(self.cache_file_name())?; self.needs_write = false; debug!( "write hash cache: {} hits={} misses={} stat={} recent={} updates={}", self.cache_file_name().display(), self.hit_count, self.miss_count, self.stat_count, self.danger_count, self.update_count ); Ok(()) } /// Reinstate cache from file. /// /// Overwrites existing cache. /// /// If the cache file has the wrong version marker, this just clears /// the cache. pub fn read(&mut self) -> Result<(), std::io::Error> { self.cache = HashMap::new(); let file = File::open(self.cache_file_name()); if file.is_err() { debug!( "failed to open {}: {}", self.cache_file_name().display(), file.err().unwrap() ); self.needs_write = true; return Ok(()); } let file = file.unwrap(); let reader = BufReader::with_capacity(65000, file); let mut lines = reader.lines(); if let Some(header) = lines.next() { if header?.as_bytes().eq(CACHE_HEADER) { self.needs_write = true; return Err(std::io::Error::new( std::io::ErrorKind::InvalidData, format!( "cache header marker not found at top of {}; discarding cache", self.cache_file_name().display() ), )); } } else { self.needs_write = true; return Err(std::io::Error::new( std::io::ErrorKind::InvalidData, "error reading cache file header".to_string(), )); } for line in lines { let line = line?; let pos = line.find("// ").unwrap(); let path = PathBuf::from(&line[..pos]); if self.cache.contains_key(&path) { info!("duplicated path {} in cache", path.display()); continue; } let pos = pos + 3; let fields = line[pos..].split(' ').collect::>(); if fields.len() != 7 { info!("bad line in hashcache: {}", line); continue; } let sha1 = fields[0].to_owned(); if sha1.len() != 40 { info!("bad sha1 in hashcache: {}", sha1); continue; } let fp = Fingerprint { size: fields[1].parse::().unwrap(), mtime: fields[2].parse::().unwrap(), ctime: fields[3].parse::().unwrap(), ino: fields[4].parse::().unwrap(), dev: fields[5].parse::().unwrap(), mode: fields[6].parse::().unwrap(), }; self.cache.insert(path, (sha1, fp)); } self.needs_write = false; Ok(()) } pub fn needs_write(&self) -> bool { self.needs_write } /// Return cutoff time. /// /// Files modified more recently than this time are at risk of being /// undetectably modified and so can't be cached. pub fn cutoff_time(&self) -> i64 { SystemTime::now() .duration_since(UNIX_EPOCH) .unwrap() .as_secs() as i64 + self.cutoff_offset } pub fn fingerprint(&self, abspath: &Path, stat_value: Option) -> Option { let stat_value = match stat_value { Some(s) => s, None => match fs::symlink_metadata(abspath) { Ok(s) => s, Err(_) => return None, }, }; if stat_value.is_dir() { return None; } Some(stat_value.into()) } } bzrformats_3.4.0.orig/crates/bazaar/src/inventory.rs0000644000000000000000000014330715162074037017606 0ustar00use crate::inventory_delta::{InventoryDelta, InventoryDeltaEntry, InventoryDeltaInconsistency}; use crate::{FileId, RevisionId}; use osutils::Kind; use std::collections::HashMap; use std::collections::HashSet; use std::collections::VecDeque; use std::hash::Hash; // This should really be an id randomly assigned when the tree is // created, but it's not for now. pub const ROOT_ID: &[u8] = b"TREE_ROOT"; pub fn versionable_kind(kind: Kind) -> bool { // Check if a kind is versionable matches!( kind, Kind::File | Kind::Directory | Kind::Symlink | Kind::TreeReference ) } #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)] pub enum Entry { Root { file_id: FileId, revision: Option, }, Directory { file_id: FileId, revision: Option, parent_id: FileId, name: String, }, File { file_id: FileId, revision: Option, parent_id: FileId, name: String, text_sha1: Option>, text_size: Option, text_id: Option>, executable: bool, }, Link { file_id: FileId, name: String, parent_id: FileId, symlink_target: Option, revision: Option, }, TreeReference { file_id: FileId, revision: Option, reference_revision: Option, name: String, parent_id: FileId, }, } #[derive(Debug)] pub enum Error { InvalidEntryName(String), DuplicateFileId(FileId, String), ParentNotDirectory(String, FileId), FileIdCycle(FileId, String, String), NoSuchId(FileId), ParentMissing(FileId), PathAlreadyVersioned(String, String), ParentNotVersioned(String), InvalidNormalization(std::path::PathBuf, String), } /// Description of a versioned file. /// /// An InventoryEntry has the following fields, which are also /// present in the XML inventory-entry element: /// /// file_id /// /// name /// (within the parent directory) /// /// parent_id /// file_id of the parent directory, or ROOT_ID /// /// revision /// the revision_id in which this variation of this file was /// introduced. /// /// executable /// Indicates that this file should be executable on systems /// that support it. /// /// text_sha1 /// sha-1 of the text of the file /// /// text_size /// size in bytes of the text of the file /// /// (reading a version 4 tree created a text_id field.) impl Entry { /// Return true if the object this entry represents has textual data. /// /// Note that textual data includes binary content. /// /// Also note that all entries get weave files created for them. /// This attribute is primarily used when upgrading from old trees that /// did not have the weave index for all inventory entries. pub fn has_text(&self) -> bool { match self { Entry::Directory { .. } => false, Entry::File { .. } => true, Entry::Link { .. } => false, Entry::TreeReference { .. } => false, Entry::Root { .. } => false, } } pub fn kind(&self) -> Kind { match self { Entry::Directory { .. } => Kind::Directory, Entry::File { .. } => Kind::File, Entry::Link { .. } => Kind::Symlink, Entry::TreeReference { .. } => Kind::TreeReference, Entry::Root { .. } => Kind::Directory, } } pub fn directory( file_id: FileId, name: String, parent_id: FileId, revision: Option, ) -> Self { Self::Directory { file_id, revision, parent_id, name, } } pub fn root(file_id: FileId, revision: Option) -> Self { Entry::Root { file_id, revision } } pub fn file( file_id: FileId, name: String, parent_id: FileId, revision: Option, text_sha1: Option>, text_size: Option, executable: Option, text_id: Option>, ) -> Self { let executable = executable.unwrap_or(false); Entry::File { file_id, name, parent_id, revision, text_sha1, text_size, text_id, executable, } } pub fn tree_reference( file_id: FileId, name: String, parent_id: FileId, revision: Option, reference_revision: Option, ) -> Self { Entry::TreeReference { file_id, revision, reference_revision, name, parent_id, } } pub fn link( file_id: FileId, name: String, parent_id: FileId, revision: Option, symlink_target: Option, ) -> Self { Entry::Link { file_id, name, parent_id, symlink_target, revision, } } pub fn file_id(&self) -> &FileId { match self { Entry::Directory { file_id, .. } => file_id, Entry::File { file_id, .. } => file_id, Entry::Link { file_id, .. } => file_id, Entry::TreeReference { file_id, .. } => file_id, Entry::Root { file_id, .. } => file_id, } } pub fn set_file_id(&mut self, new_file_id: FileId) { match self { Entry::Directory { file_id, .. } => { *file_id = new_file_id; } Entry::File { file_id, .. } => { *file_id = new_file_id; } Entry::Link { file_id, .. } => { *file_id = new_file_id; } Entry::TreeReference { file_id, .. } => { *file_id = new_file_id; } Entry::Root { file_id, .. } => { *file_id = new_file_id; } } } pub fn parent_id(&self) -> Option<&FileId> { match self { Entry::Directory { parent_id, .. } => Some(parent_id), Entry::File { parent_id, .. } => Some(parent_id), Entry::Link { parent_id, .. } => Some(parent_id), Entry::TreeReference { parent_id, .. } => Some(parent_id), Entry::Root { .. } => None, } } pub fn set_parent_id(&mut self, new_parent_id: Option) { match self { Entry::Root { .. } => { if new_parent_id.is_some() { panic!("Cannot set parent_id on root"); } } Entry::Directory { parent_id, .. } => { *parent_id = new_parent_id.unwrap(); } Entry::File { parent_id, .. } => { *parent_id = new_parent_id.unwrap(); } Entry::Link { parent_id, .. } => { *parent_id = new_parent_id.unwrap(); } Entry::TreeReference { parent_id, .. } => { *parent_id = new_parent_id.unwrap(); } } } pub fn name(&self) -> &str { match self { Entry::Directory { name, .. } => name, Entry::File { name, .. } => name, Entry::Link { name, .. } => name, Entry::TreeReference { name, .. } => name, Entry::Root { .. } => "", } } pub fn set_name(&mut self, new_name: String) { match self { Entry::Directory { name, .. } => { *name = new_name; } Entry::File { name, .. } => { *name = new_name; } Entry::Link { name, .. } => { *name = new_name; } Entry::TreeReference { name, .. } => { *name = new_name; } Entry::Root { .. } => { panic!("Cannot set name on root"); } } } pub fn revision(&self) -> Option<&RevisionId> { match self { Entry::Directory { revision, .. } => revision.as_ref(), Entry::File { revision, .. } => revision.as_ref(), Entry::Link { revision, .. } => revision.as_ref(), Entry::TreeReference { revision, .. } => revision.as_ref(), Entry::Root { revision, .. } => revision.as_ref(), } } pub fn symlink_target(&self) -> Option<&str> { match self { Entry::Directory { .. } => None, Entry::File { .. } => None, Entry::Link { symlink_target, .. } => symlink_target.as_ref().map(|s| s.as_str()), Entry::TreeReference { .. } => None, Entry::Root { .. } => None, } } pub fn is_unmodified(&self, other: &Entry) -> bool { let other_revision = other.revision(); if other_revision.is_none() { return false; } self.revision() == other_revision } pub fn unchanged(&self, other: &Entry) -> bool { let mut compatible = true; // different inv parent if self.parent_id() != other.parent_id() || self.name() != other.name() || self.kind() != other.kind() { compatible = false; } match (self, other) { ( Entry::File { text_sha1: this_text_sha1, text_size: this_text_size, executable: this_executable, .. }, Entry::File { text_sha1: other_text_sha1, text_size: other_text_size, executable: other_executable, .. }, ) => { if this_text_sha1 != other_text_sha1 { compatible = false; } if this_text_size != other_text_size { compatible = false; } if this_executable != other_executable { compatible = false; } } ( Entry::Link { symlink_target: this_symlink_target, .. }, Entry::Link { symlink_target: other_symlink_target, .. }, ) => { if this_symlink_target != other_symlink_target { compatible = false; } } ( Entry::TreeReference { reference_revision: this_reference_revision, .. }, Entry::TreeReference { reference_revision: other_reference_revision, .. }, ) => { if this_reference_revision != other_reference_revision { compatible = false; } } _ => {} } compatible } } pub enum EntryChange { Unchanged, Added, Removed, Renamed, Modified, ModifiedAndRenamed, } impl ToString for EntryChange { fn to_string(&self) -> String { match self { EntryChange::Unchanged => "unchanged".to_string(), EntryChange::Added => "added".to_string(), EntryChange::Removed => "removed".to_string(), EntryChange::Renamed => "renamed".to_string(), EntryChange::Modified => "modified".to_string(), EntryChange::ModifiedAndRenamed => "modified and renamed".to_string(), } } } /// Describe the change between old_entry and this. /// /// This smells of being an InterInventoryEntry situation, but as its /// the first one, we're making it a static method for now. /// /// An entry with a different parent, or different name is considered /// to be renamed. Reparenting is an internal detail. /// Note that renaming the parent does not trigger a rename for the /// child entry itself. pub fn describe_change(old_entry: Option<&Entry>, new_entry: Option<&Entry>) -> EntryChange { if old_entry == new_entry { return EntryChange::Unchanged; } else if old_entry.is_none() { return EntryChange::Added; } else if new_entry.is_none() { return EntryChange::Removed; } let old_entry = old_entry.unwrap(); let new_entry = new_entry.unwrap(); if old_entry.kind() != new_entry.kind() { return EntryChange::Modified; } let (text_modified, meta_modified) = detect_changes(old_entry, new_entry); let modified = text_modified || meta_modified; // TODO 20060511 (mbp, rbc) factor out 'detect_rename' here. let renamed = if old_entry.parent_id() != new_entry.parent_id() { true } else { old_entry.name() != new_entry.name() }; if renamed && !modified { return EntryChange::Renamed; } if modified && !renamed { return EntryChange::Modified; } if modified && renamed { return EntryChange::ModifiedAndRenamed; } EntryChange::Unchanged } pub fn detect_changes(old_entry: &Entry, new_entry: &Entry) -> (bool, bool) { match new_entry { Entry::Link { symlink_target: new_symlink_target, .. } => match old_entry { Entry::Link { symlink_target: old_symlink_target, .. } => (old_symlink_target != new_symlink_target, false), _ => panic!("old_entry is not a link"), }, Entry::File { text_sha1: new_text_sha1, executable: new_executable, .. } => match old_entry { Entry::File { text_sha1: old_text_sha1, executable: old_executable, .. } => { let text_modified = old_text_sha1 != new_text_sha1; let meta_modified = old_executable != new_executable; (text_modified, meta_modified) } _ => panic!("old_entry is not a file"), }, Entry::Directory { .. } | Entry::Root { .. } | Entry::TreeReference { .. } => { (false, false) } } } pub fn is_valid_name(name: &str) -> bool { !(name.contains('/') || name == "." || name == "..") } pub fn find_interesting_parents<'a>( inv: &'a MutableInventory, file_ids: &HashSet<&'a FileId>, ) -> HashSet<&'a FileId> { let mut parents: HashSet<&'a FileId> = HashSet::new(); let mut todo = file_ids.iter().cloned().collect::>(); while let Some(file_id) = todo.pop() { let ie = inv.get_entry(file_id).unwrap(); if let Some(parent_id) = ie.parent_id() { if !parents.contains(parent_id) { todo.push(parent_id); parents.insert(parent_id); } } } parents } pub trait Inventory { fn has_filename(&self, filename: &str) -> bool; fn iter_all_ids<'a>(&'a self) -> Box + 'a>; fn id2path(&self, file_id: &FileId) -> Result; fn get_entry(&self, id: &FileId) -> Option<&Entry>; fn has_id(&self, id: &FileId) -> bool; } #[derive(Clone)] pub struct MutableInventory { by_id: HashMap, root_id: Option, pub revision_id: Option, children: HashMap>, } impl Inventory for MutableInventory { fn has_filename(&self, filename: &str) -> bool { self.path2id(filename).is_some() } fn iter_all_ids<'a>(&'a self) -> Box + 'a> { Box::new(self.by_id.keys()) } fn id2path(&self, file_id: &FileId) -> Result { let mut segments = self .iter_file_id_parents(file_id)? .map(|p| p.name()) .collect::>(); segments.pop(); segments.reverse(); Ok(segments.join("/")) } fn get_entry(&self, id: &FileId) -> Option<&Entry> { self.by_id.get(id) } fn has_id(&self, id: &FileId) -> bool { self.by_id.contains_key(id) } } impl MutableInventory { pub fn new() -> MutableInventory { Self { by_id: HashMap::new(), root_id: None, revision_id: None, children: HashMap::new(), } } pub fn get_children(&self, file_id: &FileId) -> Option> { Some( self.children .get(file_id)? .iter() .map(|(k, v)| (k.as_str(), self.get_entry(v).expect("child not found"))) .collect(), ) } pub fn change_root_id(&mut self, new_root_id: FileId) { let mut children = self .children .remove(self.root_id.as_ref().unwrap()) .unwrap(); self.by_id.remove(self.root_id.as_ref().unwrap()); self.root_id = Some(new_root_id.clone()); self.by_id.insert( new_root_id.clone(), Entry::Root { file_id: new_root_id.clone(), revision: None, }, ); for (_n, child) in children.iter_mut() { self.by_id .get_mut(child) .unwrap() .set_parent_id(Some(new_root_id.clone())); } self.children.insert(new_root_id, children); } pub fn iter_sorted_children( &self, file_id: &FileId, ) -> Option> { let children = self.get_children(file_id)?; // Sort the children by name and then return them let mut children = children.into_iter().collect::>(); children.sort_by(|(a, _), (b, _)| a.cmp(b)); Some(children.into_iter()) } pub fn entries(&self) -> Vec<(String, &Entry)> { let mut accum = Vec::new(); let mut todo = Vec::new(); if let Some(ref root_id) = self.root_id { todo.push((root_id, "".to_string())); } while !todo.is_empty() { if let Some((dir_id, dir_path)) = todo.pop() { for (name, ie) in self.iter_sorted_children(dir_id).unwrap() { let child_path = if dir_path.is_empty() { name.to_string() } else { format!("{}/{}", dir_path, name) }; accum.push((child_path.clone(), ie)); if ie.kind() == Kind::Directory { todo.push(((ie.file_id()), child_path)); } } } } accum } pub fn rename_id(&mut self, old_file_id: &FileId, new_file_id: &FileId) -> Result<(), Error> { if old_file_id == new_file_id { return Ok(()); } if self.by_id.contains_key(new_file_id) { return Err(Error::DuplicateFileId( new_file_id.clone(), self.id2path(new_file_id).unwrap(), )); } let mut ie = self .by_id .remove(old_file_id) .ok_or_else(|| Error::NoSuchId(old_file_id.clone()))?; if let Some(children) = self.children.remove(old_file_id) { for child_id in children.values() { let child = self.by_id.get_mut(child_id).unwrap(); assert_eq!(child.parent_id(), Some(old_file_id)); child.set_parent_id(Some(new_file_id.clone())); } self.children.insert(new_file_id.clone(), children); } ie.set_file_id(new_file_id.clone()); self.by_id.insert(new_file_id.clone(), ie); if self.root_id == Some(old_file_id.clone()) { self.root_id = Some(new_file_id.clone()); } Ok(()) } pub fn path2id(&self, relpath: &str) -> Option<&FileId> { if let Some(ie) = self.get_entry_by_path(relpath) { Some(ie.file_id()) } else { None } } pub fn path2id_segments(&self, names: &[&str]) -> Option<&FileId> { if let Some(ie) = self.get_entry_by_path_segments(names) { Some(ie.file_id()) } else { None } } /// Get an inventory view filtered against a set of file-ids. /// /// Children of directories and parents are included. /// /// The result may or may not reference the underlying inventory /// so it should be treated as immutable. pub fn filter(&self, specific_fileids: &HashSet<&FileId>) -> Result { let mut interesting_parents = HashSet::new(); for file_id in specific_fileids { match self.get_idpath(file_id) { Ok(parents) => { interesting_parents.extend(parents); } Err(Error::NoSuchId(_)) => {} Err(e) => { return Err(e); } } } let mut entries = self.iter_entries(None); let root = entries.next(); let mut other = Self::new(); if root.is_none() { return Ok(other); } other.set_root(root.unwrap().1.clone()); let mut directories_to_expand = HashSet::new(); for (_path, entry) in entries { let file_id = entry.file_id(); if specific_fileids.contains(file_id) || (entry.parent_id().is_some() && directories_to_expand.contains(entry.parent_id().unwrap())) { if entry.kind() == Kind::Directory { directories_to_expand.insert(file_id); } } else if !interesting_parents.contains(file_id) { continue; } other.add(entry.clone()).unwrap(); } Ok(other) } /// Return a list of file_ids for the path to an entry. /// /// The list contains one element for each directory followed by /// the id of the file itself. So the length of the returned list /// is equal to the depth of the file in the tree, counting the /// root directory as depth 1. pub fn get_idpath<'a>(&'a self, file_id: &'a FileId) -> Result, Error> { Ok(self .iter_file_id_parents(file_id)? .map(|e| e.file_id()) .collect()) } pub fn get_entry_by_path_partial( &self, relpath: &str, ) -> Option<(&Entry, Vec, Vec)> { let names = osutils::path::splitpath(relpath).unwrap(); self.get_entry_by_path_segments_partial(&names) } pub fn get_entry_by_path_segments_partial( &self, names: &[&str], ) -> Option<(&Entry, Vec, Vec)> { self.root_id.as_ref()?; let mut parent = self.by_id.get(self.root_id.as_ref().unwrap()).unwrap(); for (i, f) in names.iter().enumerate() { if let Some(cie) = self.get_child(parent.file_id(), f) { parent = cie; if cie.kind() == Kind::TreeReference { let (before, after) = names.split_at(i + 1); return Some(( cie, before.iter().map(|s| s.to_string()).collect(), after.iter().map(|s| s.to_string()).collect(), )); } } else { return None; } } Some(( parent, names.iter().map(|s| s.to_string()).collect(), Vec::new(), )) } pub fn get_entry_by_path(&self, relpath: &str) -> Option<&Entry> { self.get_entry_by_path_segments( osutils::path::splitpath(relpath).unwrap().as_slice(), ) } pub fn get_entry_by_path_segments(&self, names: &[&str]) -> Option<&Entry> { self.root_id.as_ref()?; let mut parent = self.by_id.get(self.root_id.as_ref().unwrap()).unwrap(); for f in names { if let Some(cie) = self.get_child(parent.file_id(), f) { parent = cie; } else { return None; } } Some(parent) } /// Return (path, entry) pairs, in order by name. /// /// Args: /// from_dir: if None, start from the root, /// otherwise start from this directory (either file-id or entry) pub fn iter_entries<'a>( &'a self, from_dir: Option<&FileId>, ) -> impl Iterator { let mut stack = VecDeque::new(); let mut from_dir = if from_dir.is_none() { self.root_id.clone() } else { from_dir.cloned() }; if let Some(from_dir) = from_dir.as_ref() { let children = self .iter_sorted_children(from_dir) .unwrap() .collect::>(); stack.push_back((String::new(), children)); } std::iter::from_fn(move || -> Option<(String, &Entry)> { if let Some(from_dir) = from_dir.take() { let entry = self.by_id.get(&from_dir)?; return Some((String::new(), entry)); } loop { if let Some((base, children)) = stack.back_mut() { if let Some((name, ie)) = children.pop_front() { let path = if base.is_empty() { name.to_string() } else { format!("{}/{}", base, name) }; if ie.kind() == Kind::Directory { let children = self .iter_sorted_children(ie.file_id()) .unwrap() .collect::>(); stack.push_back((path.clone(), children)); } return Some((path, ie)); } else { stack.pop_back(); } } else { return None; } } }) } /// Iterate over the entries in a directory first order. /// /// This returns all entries for a directory before returning /// the entries for children of a directory. This is not /// lexicographically sorted order, and is a hybrid between /// depth-first and breadth-first. /// /// This yields (path, entry) pairs pub fn iter_entries_by_dir<'a>( &'a self, from_dir: Option<&'a FileId>, specific_file_ids: Option<&'a HashSet<&FileId>>, ) -> impl Iterator + 'a { let parents = specific_file_ids .map(|specific_file_ids| find_interesting_parents(self, specific_file_ids)); let mut stack: Vec<(String, &FileId)> = vec![]; let from_dir = if from_dir.is_none() { self.root_id.as_ref() } else { from_dir }; let mut children = VecDeque::new(); if let Some(from_dir) = from_dir { stack.push(("".to_string(), from_dir)); children.extend( self.iter_sorted_children(from_dir) .unwrap() .map(|(p, ie)| (p.to_string(), ie)), ); } std::iter::from_fn(move || -> Option<(String, &'a Entry)> { loop { if let Some(e) = children.pop_front() { return Some(e); } if let Some((cur_relpath, cur_dir)) = stack.pop() { let mut child_dirs = Vec::new(); for (child_name, child_ie) in self.iter_sorted_children(cur_dir).unwrap() { let child_relpath = cur_relpath.to_string() + child_name; if specific_file_ids.is_none() || specific_file_ids.unwrap().contains(child_ie.file_id()) { children.push_back((child_relpath.clone(), child_ie)); } if child_ie.kind() == Kind::Directory && (parents.is_none() || parents.as_ref().unwrap().contains(child_ie.file_id())) { child_dirs.push((child_relpath + "/", child_ie.file_id())) } } stack.extend(child_dirs.into_iter().rev()); } else { return None; } } }) } /// Apply a delta to this inventory. /// /// See the inventory developers documentation for the theory behind /// inventory deltas. /// /// If delta application fails the inventory is left in an indeterminate /// state and must not be used. /// /// # Arguments /// * `delta`: A list of changes to apply. After all the changes are /// applied the final inventory must be internally consistent, but it /// is ok to supply changes which, if only half-applied would have an /// invalid result - such as supplying two changes which rename two /// files, 'A' and 'B' with each other : [('A', 'B', b'A-id', a_entry), /// ('B', 'A', b'B-id', b_entry)]. /// /// Each change is a tuple, of the form (old_path, new_path, file_id, /// new_entry). /// /// When new_path is None, the change indicates the removal of an entry /// from the inventory and new_entry will be ignored (using None is /// appropriate). If new_path is not None, then new_entry must be an /// InventoryEntry instance, which will be incorporated into the /// inventory (and replace any existing entry with the same file id). /// /// When old_path is None, the change indicates the addition of /// a new entry to the inventory. /// /// When neither new_path nor old_path are None, the change is a /// modification to an entry, such as a rename, reparent, kind change /// etc. /// /// The children attribute of new_entry is ignored. This is because /// this method preserves children automatically across alterations to /// the parent of the children, and cases where the parent id of a /// child is changing require the child to be passed in as a separate /// change regardless. E.g. in the recursive deletion of a directory - /// the directory's children must be included in the delta, or the /// final inventory will be invalid. /// /// Note that a file_id must only appear once within a given delta. /// An AssertionError is raised otherwise. pub fn apply_delta( &mut self, delta: &InventoryDelta, ) -> std::result::Result<(), InventoryDeltaInconsistency> { // Check that the delta is legal. It would be nice if this could be // done within the loops below but it's safer to validate the delta // before starting to mutate the inventory, as there isn't a rollback // facility. delta.check()?; let mut children = HashMap::new(); // Remove all affected items which were in the original inventory, // starting with the longest paths, thus ensuring parents are examined // after their children, which means that everything we examine has no // modified children remaining by the time we examine it. let mut old = delta .iter() .filter_map(|d| { d.old_path .as_ref() .map(|old_path| (old_path, d.file_id.clone())) }) .collect::>(); old.sort(); old.reverse(); for (old_path, file_id) in old { if &self.id2path(&file_id).unwrap() != old_path { return Err(InventoryDeltaInconsistency::PathMismatch( file_id.clone(), old_path.clone(), self.id2path(&file_id).unwrap(), )); } // Remove file_id and the unaltered children. If file_id is not being deleted it will // be reinserted later. let ie = self.by_id.remove(&file_id).unwrap(); if let Some(parent_id) = ie.parent_id() { self.children.get_mut(parent_id).unwrap().remove(ie.name()); } // Preserve unaltered children of file_id for later reinsertion. if let Some(file_id_children) = self.children.remove(&file_id) { if !file_id_children.is_empty() { children.insert(file_id, file_id_children); } } } // Insert all affected which should be in the new inventory, reattaching // their children if they had any. This is done from shortest path to // longest, ensuring that items which were modified and whose parents in // the resulting inventory were also modified, are inserted after their // parents. let mut new = delta .iter() .filter_map(|de| { de.new_path .as_ref() .map(|new_path| (new_path, &de.file_id, &de.new_entry)) }) .collect::>(); new.sort(); for (new_path, _fid, new_entry) in new { let new_entry = new_entry.as_ref().unwrap(); self.add(new_entry.clone()).map_err(|e| match e { Error::DuplicateFileId(fid, _path) => { InventoryDeltaInconsistency::DuplicateFileId(new_path.clone(), fid) } Error::ParentNotDirectory(_path, fid) => { InventoryDeltaInconsistency::ParentNotDirectory(new_path.clone(), fid) } Error::NoSuchId(fid) => InventoryDeltaInconsistency::NoSuchId(fid), Error::InvalidEntryName(name) => { InventoryDeltaInconsistency::InvalidEntryName(name) } Error::FileIdCycle(fid, path, parent) => { InventoryDeltaInconsistency::FileIdCycle(fid, path, parent) } Error::ParentMissing(fid) => InventoryDeltaInconsistency::ParentMissing(fid), Error::PathAlreadyVersioned(new_name, parent_path) => { InventoryDeltaInconsistency::PathAlreadyVersioned(new_name, parent_path) } Error::ParentNotVersioned(_parent_path) => { unreachable!(); } Error::InvalidNormalization(_path, _msg) => unreachable!(), })?; if &self.id2path(new_entry.file_id()).unwrap() != new_path { return Err(InventoryDeltaInconsistency::PathMismatch( new_entry.file_id().clone(), new_path.clone(), self.id2path(new_entry.file_id()).unwrap(), )); } if let Some(children) = children.remove(new_entry.file_id()) { self.children.insert(new_entry.file_id().clone(), children); } } if !children.is_empty() { // Get the parent id that was deleted let (parent_id, _children) = children.drain().next().unwrap(); return Err(InventoryDeltaInconsistency::OrphanedChild(parent_id)); } Ok(()) } pub fn create_by_apply_delta( &self, inventory_delta: &InventoryDelta, new_revision_id: RevisionId, ) -> Result { let mut new_inv = self.clone(); new_inv.apply_delta(inventory_delta)?; new_inv.revision_id = Some(new_revision_id); Ok(new_inv) } fn clear(&mut self) { self.root_id = None; self.by_id = HashMap::new(); self.children = HashMap::new(); } fn set_root(&mut self, mut ie: Entry) { ie.set_parent_id(None); self.clear(); self.root_id = Some(ie.file_id().clone()); self.by_id.insert(ie.file_id().clone(), ie.clone()); self.children .insert(self.root_id.clone().unwrap(), HashMap::new()); } pub fn len(&self) -> usize { self.by_id.len() } pub fn is_empty(&self) -> bool { self.by_id.is_empty() } pub fn get_file_kind(&self, id: &FileId) -> Option { self.by_id.get(id).map(|e| e.kind()) } /// Returns the entries leading up to the given file_id, including the entry fn iter_file_id_parents<'a>( &'a self, id: &'a FileId, ) -> Result + 'a, Error> { let mut entry: Option<&'a Entry> = self.by_id.get(id); if entry.is_none() { return Err(Error::NoSuchId(id.clone())); } Ok(std::iter::from_fn(move || { if let Some(e) = entry { if let Some(parent_id) = e.parent_id() { entry = Some(self.by_id.get(parent_id).unwrap()); } else { entry = None; } Some(e) } else { None } })) } pub fn root(&self) -> Option<&Entry> { self.get_entry(self.root_id.as_ref()?) } pub fn is_root(&self, id: FileId) -> bool { self.root_id == Some(id) } /// Iterate over all entries. /// /// Unlike iter_entries(), just the entries are returned (not (path, ie)) /// and the order of entries is undefined. pub fn iter_just_entries(&self) -> impl Iterator + '_ { self.by_id.values() } pub fn get_child(&self, parent_id: &FileId, filename: &str) -> Option<&Entry> { if let Some(siblings) = self.children.get(parent_id) { if let Some(child_id) = siblings.get(filename) { self.by_id.get(child_id) } else { None } } else { None } } pub fn add(&mut self, ie: Entry) -> Result<(), Error> { if self.by_id.contains_key(ie.file_id()) { return Err(Error::DuplicateFileId( ie.file_id().clone(), self.id2path(ie.file_id()).unwrap(), )); } if let Some(parent_id) = ie.parent_id() { let parent = self .by_id .get(parent_id) .ok_or_else(|| Error::ParentMissing(parent_id.clone()))?; match parent { Entry::Directory { .. } | Entry::Root { .. } => {} _ => { return Err(Error::ParentNotDirectory( self.id2path(parent_id).unwrap(), ie.file_id().clone(), )); } } let siblings = self.children.get_mut(parent.file_id()).unwrap(); match siblings.entry(ie.name().to_string()) { std::collections::hash_map::Entry::Vacant(entry) => { entry.insert(ie.file_id().clone()); } std::collections::hash_map::Entry::Occupied(entry) => { let fid = entry.get().clone(); return Err(Error::PathAlreadyVersioned( self.id2path(&fid).unwrap(), self.id2path(parent.file_id()).unwrap(), )); } } } else { assert!(matches!(ie, Entry::Root { .. })); self.root_id = Some(ie.file_id().clone()); } match ie { Entry::Directory { ref file_id, .. } | Entry::Root { ref file_id, .. } => { self.children.insert(file_id.clone(), HashMap::new()); } _ => {} } self.by_id.insert(ie.file_id().clone(), ie); Ok(()) } pub fn add_path( &mut self, relpath: &str, kind: Kind, file_id: Option, revision: Option, text_sha1: Option>, text_size: Option, executable: Option, text_id: Option>, symlink_target: Option, reference_revision: Option, ) -> Result { let parts = osutils::path::splitpath(relpath).unwrap(); if parts.is_empty() { self.clear(); let file_id = Some(file_id.unwrap_or_else(FileId::generate_root_id)); let root = Entry::root(file_id.as_ref().unwrap().clone(), revision); self.add(root)?; Ok(self.root_id.as_ref().unwrap().clone()) } else { let (basename, parent_path) = parts.split_last().unwrap(); let parent_id = self.path2id_segments(parent_path); if parent_id.is_none() { return Err(Error::ParentNotVersioned(parent_path.join("/"))); } let ie = make_entry( kind, basename.to_string(), parent_id.cloned(), file_id, revision, text_sha1, text_size, executable, text_id, symlink_target, reference_revision, )?; let file_id = ie.file_id().clone(); self.add(ie)?; Ok(file_id) } } pub fn delete(&mut self, file_id: &FileId) -> Result<(), Error> { let ie = self .by_id .remove(file_id) .ok_or_else(|| Error::NoSuchId(file_id.clone()))?; if let Some(parent_id) = ie.parent_id() { let siblings = self.children.get_mut(parent_id).unwrap(); siblings.remove(ie.name()); } else { assert_eq!(file_id, self.root_id.as_ref().unwrap()); self.root_id = None; } Ok(()) } pub fn make_delta(&self, old: &dyn Inventory) -> InventoryDelta { let old_ids = old.iter_all_ids().collect::>(); let new_ids = self.iter_all_ids().collect::>(); let adds = new_ids.difference(&old_ids).collect::>(); let deletes = old_ids.difference(&new_ids).collect::>(); let common = if adds.is_empty() && deletes.is_empty() { new_ids.clone() } else { old_ids .intersection(&new_ids) .cloned() .collect::>() }; let mut delta = Vec::new(); for file_id in deletes { delta.push(InventoryDeltaEntry { old_path: Some(old.id2path(file_id).unwrap()), new_path: None, file_id: (*file_id).clone(), new_entry: None, }); } for file_id in adds { delta.push(InventoryDeltaEntry { old_path: None, new_path: Some(self.id2path(file_id).unwrap()), file_id: (*file_id).clone(), new_entry: self.get_entry(file_id).cloned(), }); } for file_id in common { let new_ie = self.get_entry(file_id); let old_ie = old.get_entry(file_id); // If xml_serializer returns the cached InventoryEntries (rather // than always doing .copy()), inlining the 'is' check saves 2.7M // calls to __eq__. Under lsprof this saves 20s => 6s. // It is a minor improvement without lsprof. if old_ie == new_ie { continue; } delta.push(InventoryDeltaEntry { old_path: Some(old.id2path(file_id).unwrap()), new_path: Some(self.id2path(file_id).unwrap()), file_id: file_id.clone(), new_entry: new_ie.cloned(), }); } InventoryDelta(delta) } pub fn remove_recursive_id(&mut self, file_id: &FileId) -> Vec { let start_ie = self.by_id.get(file_id).unwrap().clone(); let mut to_find_delete = vec![start_ie]; let mut to_delete = Vec::new(); while let Some(ie) = to_find_delete.pop() { if ie.kind() == Kind::Directory { to_find_delete.extend( self.get_children(ie.file_id()) .unwrap() .values() .cloned() .cloned(), ); } to_delete.push(ie); } let mut deleted = Vec::new(); to_delete.reverse(); for ie in to_delete { deleted.push(self.by_id.remove(ie.file_id()).unwrap()); if ie.kind() == Kind::Directory { let children = self.children.remove(ie.file_id()).unwrap(); assert!(children.is_empty()); } else { assert!(!self.children.contains_key(ie.file_id())); } if let Some(parent_id) = ie.parent_id() { let siblings = self.children.get_mut(parent_id).unwrap(); siblings.remove(ie.name()); } else { self.root_id = None; } } deleted.reverse(); deleted } pub fn rename( &mut self, file_id: &FileId, new_parent_id: &FileId, new_name: &str, ) -> Result<(), Error> { let new_name = std::path::PathBuf::from(new_name); let new_name = ensure_normalized_name(new_name.as_path())?; let new_name = new_name.to_str().unwrap(); if !is_valid_name(new_name) { return Err(Error::InvalidEntryName(new_name.to_string())); } let new_siblings = self.children.get_mut(new_parent_id).unwrap(); if new_siblings.contains_key(new_name) { return Err(Error::PathAlreadyVersioned( new_name.to_string(), self.id2path(new_parent_id).unwrap(), )); } let new_parent_idpath = self.get_idpath(new_parent_id).unwrap(); if new_parent_idpath.contains(&file_id) { return Err(Error::FileIdCycle( file_id.clone(), self.id2path(file_id).unwrap(), self.id2path(new_parent_id).unwrap(), )); } let file_ie = self.by_id.get(file_id).unwrap(); let old_parent = self.by_id.get(file_ie.parent_id().unwrap()).unwrap(); let new_parent = self.by_id.get(new_parent_id).unwrap(); // TODO: Don't leave things messed up if this fails self.children .get_mut(old_parent.file_id()) .unwrap() .remove(file_ie.name()); self.children .get_mut(new_parent.file_id()) .unwrap() .insert(new_name.to_string(), file_id.clone()); let file_ie = self.by_id.get_mut(file_id).unwrap(); file_ie.set_name(new_name.to_string()); file_ie.set_parent_id(Some(new_parent_id.clone())); Ok(()) } } impl Default for MutableInventory { fn default() -> Self { Self::new() } } impl std::fmt::Debug for MutableInventory { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { const MAX_LEN: usize = 2048; const CLOSING: &str = "...}"; let mut contents = format!("{:?}", self.by_id); if contents.len() > MAX_LEN { contents = contents[0..MAX_LEN - CLOSING.len()].to_string() + CLOSING; } write!( f, "", self, self.by_id.len(), contents, ) } } impl PartialEq for MutableInventory { fn eq(&self, other: &Self) -> bool { self.by_id == other.by_id } } impl Eq for MutableInventory {} // Normalize name pub fn ensure_normalized_name(name: &std::path::Path) -> Result { let (norm_name, can_access) = osutils::path::normalized_filename(name).ok_or_else(|| { Error::InvalidNormalization(name.to_path_buf(), "name is not normalized".to_string()) })?; if norm_name != name { if can_access { return Ok(norm_name); } else { return Err(Error::InvalidNormalization( name.to_path_buf(), "name '{}' is not normalized and cannot be accessed".to_string(), )); } } Ok(name.to_path_buf()) } pub fn make_entry( kind: Kind, name: String, parent_id: Option, file_id: Option, revision: Option, text_sha1: Option>, text_size: Option, executable: Option, text_id: Option>, symlink_target: Option, reference_revision: Option, ) -> Result { let file_id = file_id.unwrap_or_else(|| FileId::generate(name.as_str())); if !is_valid_name(&name) { panic!("Invalid name: {}", name); } let name = ensure_normalized_name(std::path::Path::new(&name))? .to_str() .unwrap() .to_string(); Ok(match kind { Kind::File => Entry::file( file_id, name, parent_id.unwrap(), revision, text_sha1, text_size, executable, text_id, ), Kind::Directory => { if let Some(parent_id) = parent_id { Entry::directory(file_id, name, parent_id, revision) } else { Entry::root(file_id, revision) } } Kind::Symlink => Entry::link(file_id, name, parent_id.unwrap(), revision, symlink_target), Kind::TreeReference => Entry::tree_reference( file_id, name, parent_id.unwrap(), revision, reference_revision, ), }) } bzrformats_3.4.0.orig/crates/bazaar/src/inventory_delta.rs0000644000000000000000000005020415162074037020750 0ustar00//! Inventory delta serialisation. //! //! See doc/developers/inventory.txt for the description of the format. //! //! In this module the interesting classes are: //! - InventoryDeltaSerializer - object to read/write inventory deltas. use crate::inventory::Entry; use crate::{FileId, RevisionId, NULL_REVISION}; use std::collections::HashSet; use std::iter::FromIterator; #[derive(Debug, PartialEq, Eq, Clone)] pub struct InventoryDeltaEntry { pub old_path: Option, pub new_path: Option, pub file_id: FileId, pub new_entry: Option, } #[derive(Debug, PartialEq, Eq, Clone)] pub struct InventoryDelta(pub Vec); impl FromIterator for InventoryDelta { fn from_iter>(iter: T) -> Self { InventoryDelta(iter.into_iter().collect()) } } impl From> for InventoryDelta { fn from(v: Vec) -> Self { InventoryDelta(v) } } impl std::ops::Deref for InventoryDelta { type Target = Vec; fn deref(&self) -> &Self::Target { &self.0 } } impl std::ops::DerefMut for InventoryDelta { fn deref_mut(&mut self) -> &mut Self::Target { &mut self.0 } } pub enum InventoryDeltaInconsistency { DuplicateFileId(String, FileId), DuplicateOldPath(String, FileId), DuplicateNewPath(String, FileId), NoPath, MismatchedId(String, FileId, FileId), EntryWithoutPath(String, FileId), PathWithoutEntry(String, FileId), PathMismatch(FileId, String, String), OrphanedChild(FileId), ParentNotDirectory(String, FileId), ParentMissing(FileId), NoSuchId(FileId), InvalidEntryName(String), FileIdCycle(FileId, String, String), PathAlreadyVersioned(String, String), } impl InventoryDelta { pub fn check(&self) -> Result<(), InventoryDeltaInconsistency> { let mut ids = HashSet::new(); let mut old_paths = HashSet::new(); let mut new_paths = HashSet::new(); for entry in self.iter() { let path = if let Some(old_path) = &entry.old_path { old_path } else if let Some(new_path) = &entry.new_path { new_path } else { return Err(InventoryDeltaInconsistency::NoPath); }; if !ids.insert(&entry.file_id) { return Err(InventoryDeltaInconsistency::DuplicateFileId( path.clone(), entry.file_id.clone(), )); } if entry.old_path.is_some() { let old_path = entry.old_path.as_ref().unwrap(); if !old_paths.insert(old_path) { return Err(InventoryDeltaInconsistency::DuplicateOldPath( old_path.clone(), entry.file_id.clone(), )); } } if entry.new_path.is_some() { let new_path = entry.new_path.as_ref().unwrap(); if !new_paths.insert(new_path) { return Err(InventoryDeltaInconsistency::DuplicateNewPath( new_path.clone(), entry.file_id.clone(), )); } } if let Some(ref new_entry) = entry.new_entry { if &entry.file_id != new_entry.file_id() { return Err(InventoryDeltaInconsistency::MismatchedId( path.clone(), entry.file_id.clone(), new_entry.file_id().clone(), )); } } if entry.new_entry.is_some() && entry.new_path.is_none() { return Err(InventoryDeltaInconsistency::EntryWithoutPath( path.clone(), entry.file_id.clone(), )); } if entry.new_entry.is_none() && entry.new_path.is_some() { return Err(InventoryDeltaInconsistency::PathWithoutEntry( path.clone(), entry.file_id.clone(), )); } } Ok(()) } pub fn sort(&mut self) { fn key(entry: &InventoryDeltaEntry) -> (&str, &str, &FileId, Option<&Entry>) { ( entry.old_path.as_deref().unwrap_or(""), entry.new_path.as_deref().unwrap_or(""), &entry.file_id, entry.new_entry.as_ref(), ) } self.sort_by(|x, y| key(y).cmp(&key(x))); } } #[derive(Debug)] pub enum InventoryDeltaSerializeError { Invalid(String), UnsupportedKind(String), } const FORMAT_1: &str = "bzr inventory delta v1 (bzr 1.14)"; pub fn serialize_inventory_entry(e: &Entry) -> Result, InventoryDeltaSerializeError> { Ok(match e { Entry::Directory { .. } | Entry::Root { .. } => b"dir".to_vec(), Entry::File { executable, text_size, ref text_sha1, .. } => { let mut v = b"file".to_vec(); v.push(b'\x00'); if text_size.is_none() { return Err(InventoryDeltaSerializeError::Invalid( "text_size is None".to_string(), )); } v.extend_from_slice(text_size.unwrap().to_string().as_bytes()); v.push(b'\x00'); if *executable { v.push(b'Y'); } v.push(b'\x00'); let text_sha1 = text_sha1.as_ref(); if text_sha1.is_none() { return Err(InventoryDeltaSerializeError::Invalid( "text_sha1 is None".to_string(), )); } v.extend_from_slice(text_sha1.unwrap().as_slice()); v } Entry::Link { symlink_target, .. } => { let mut v = b"link".to_vec(); v.push(b'\x00'); if symlink_target.is_none() { return Err(InventoryDeltaSerializeError::Invalid( "symlink_target is None".to_string(), )); } v.extend_from_slice(symlink_target.as_ref().unwrap().as_bytes()); v } Entry::TreeReference { reference_revision, .. } => { let mut v = b"tree".to_vec(); v.push(b'\x00'); if reference_revision.is_none() { return Err(InventoryDeltaSerializeError::Invalid( "reference_revision is None".to_string(), )); } v.extend_from_slice(reference_revision.as_ref().unwrap().as_bytes()); v } }) } pub fn serialize_inventory_delta( old_name: &RevisionId, new_name: &RevisionId, delta_to_new: &InventoryDelta, versioned_root: bool, tree_references: bool, ) -> Result>, InventoryDeltaSerializeError> { let mut lines = vec![ format!("format: {}\n", FORMAT_1).into_bytes(), [&b"parent: "[..], old_name.as_bytes(), &b"\n"[..]].concat(), [&b"version: "[..], new_name.as_bytes(), &b"\n"[..]].concat(), format!("versioned_root: {}\n", serialize_bool(versioned_root)).into_bytes(), format!("tree_references: {}\n", serialize_bool(tree_references)).into_bytes(), ]; let mut extra_lines = delta_to_new .iter() .map(|entry| { if let Some(entry) = entry.new_entry.as_ref() { if !tree_references && entry.kind() == osutils::Kind::TreeReference { return Err(InventoryDeltaSerializeError::UnsupportedKind( "tree-reference".to_string(), )); } } delta_entry_to_line(entry, new_name, Some(versioned_root)) }) .collect::, _>>()?; extra_lines.sort(); lines.extend(extra_lines); Ok(lines) } /// Return a line sequence for delta_to_new. /// /// :param old_name: A UTF8 revision id for the old inventory. May be /// NULL_REVISION if there is no older inventory and delta_to_new /// includes the entire inventory contents. /// :param new_name: The version name of the inventory we create with this /// delta. /// :param delta_to_new: An inventory delta such as Inventory.apply_delta /// takes. /// :return: The serialized delta as lines. fn delta_entry_to_line( delta_item: &InventoryDeltaEntry, new_version: &RevisionId, versioned_root: Option, ) -> Result, InventoryDeltaSerializeError> { let versioned_root = versioned_root.unwrap_or(true); let last_modified; let parent_id; let oldpath_utf8; let newpath_utf8; let content; if delta_item.new_path.is_none() { // delete if delta_item.old_path.is_none() { return Err(InventoryDeltaSerializeError::Invalid(format!( "Bad inventory delta: old_path is None in delta item {:?}", delta_item ))); } oldpath_utf8 = format!("/{}", delta_item.old_path.as_ref().unwrap()); newpath_utf8 = "None".to_string(); parent_id = &b""[..]; last_modified = RevisionId::from(NULL_REVISION); content = b"deleted\x00\x00".to_vec(); } else { oldpath_utf8 = if let Some(ref old_path) = delta_item.old_path { format!("/{}", old_path) } else { "None".to_string() }; if delta_item.new_entry.is_none() { return Err(InventoryDeltaSerializeError::Invalid(format!( "Bad inventory delta: new_entry is None in delta item {:?}", delta_item ))); } let new_entry = delta_item.new_entry.as_ref().unwrap(); if delta_item.new_path == Some("/".to_string()) { return Err(InventoryDeltaSerializeError::Invalid(format!( "Bad inventory delta: '/' is not a valid newpath (should be '') in delta item {:?}", delta_item ))); } newpath_utf8 = format!( "/{}", delta_item.new_path.as_ref().unwrap_or(&"".to_string()) ); // Serialize None as '' parent_id = new_entry .parent_id() .as_ref() .map_or(&b""[..], |x| x.as_bytes()); // Serialize unknown revisions as NULL_REVISION if new_entry.revision().is_none() { return Err(InventoryDeltaSerializeError::Invalid(format!( "no version for fileid {:?}", delta_item.file_id ))); } last_modified = new_entry.revision().unwrap().clone(); // special cases for / if newpath_utf8 == "/" && !versioned_root { // This is an entry for the root, this inventory does not // support versioned roots. So this must be an unversioned // root, i.e. last_modified == new revision. Otherwise, this // delta is invalid. // Note: the non-rich-root repositories *can* have roots with // file-ids other than TREE_ROOT, e.g. repo formats that use the // xml5 serializer. if &last_modified != new_version { return Err(InventoryDeltaSerializeError::Invalid(format!( "Version present for / in {:?} ({:?} != {:?})", new_entry.file_id(), last_modified, new_version ))); } } content = serialize_inventory_entry(new_entry)?; } let entries = [ oldpath_utf8.as_bytes(), newpath_utf8.as_bytes(), delta_item.file_id.as_bytes(), parent_id, last_modified.as_bytes(), content.as_slice(), ]; let mut line = entries.join(&b"\x00"[..]); line.push(b'\n'); Ok(line) } pub fn parse_inventory_entry( file_id: FileId, name: String, parent_id: Option, revision: Option, data: &[u8], ) -> Entry { let mut parts = data.split(|&c| c == b'\x00'); let entry_type = parts.next().unwrap(); match entry_type { b"dir" => { if parent_id.is_none() { Entry::Root { file_id, revision } } else { Entry::Directory { file_id, name, parent_id: parent_id.unwrap(), revision, } } } b"file" => { let text_size = parts.next().unwrap(); let executable = parts.next().unwrap(); let text_sha1 = parts.next().unwrap(); Entry::File { file_id, name, parent_id: parent_id.unwrap(), executable: executable == b"Y", text_id: None, text_size: Some( String::from_utf8(text_size.to_vec()) .unwrap() .parse() .unwrap(), ), text_sha1: Some(text_sha1.to_vec()), revision, } } b"link" => { let symlink_target = parts.next().unwrap(); Entry::Link { file_id, name, parent_id: parent_id.unwrap(), symlink_target: Some(String::from_utf8(symlink_target.to_vec()).unwrap()), revision, } } b"tree" => { let reference_revision = parts.next().unwrap(); Entry::TreeReference { file_id, name, parent_id: parent_id.unwrap(), reference_revision: Some(RevisionId::from(reference_revision)), revision, } } _ => panic!("Invalid entry type: {:?}", entry_type), } } fn serialize_bool(value: bool) -> &'static str { if value { "true" } else { "false" } } fn parse_bool(value: &[u8]) -> Result { match value { b"true" => Ok(true), b"false" => Ok(false), _ => Err(format!("Invalid boolean value: {:?}", value)), } } pub fn parse_inventory_delta_item( line: &[u8], versioned_root: bool, tree_references: bool, delta_version_id: &RevisionId, ) -> Result { let parts = line.splitn(6, |&c| c == b'\x00').collect::>(); let oldpath_utf8 = parts[0]; let newpath_utf8 = parts[1]; let file_id = FileId::from(parts[2]); let parent_id = if parts[3].is_empty() { None } else { Some(FileId::from(parts[3])) }; let last_modified = RevisionId::from(parts[4]); let content = parts[5]; if newpath_utf8 == b"/" && !versioned_root && &last_modified != delta_version_id { return Err(InventoryDeltaParseError::Invalid( "Versioned root found".to_string(), )); } else if newpath_utf8 != b"None" && last_modified.is_reserved() { return Err(InventoryDeltaParseError::Invalid(format!( "special revisionid found: {:?}", last_modified ))); } if content.starts_with(b"tree\x00") && !tree_references { return Err(InventoryDeltaParseError::Invalid( "Tree reference found (but header said tree_references: false)".to_string(), )); } fn parse_path(kind: &str, path: &[u8]) -> Result, InventoryDeltaParseError> { if path == b"None" { Ok(None) } else if !path.starts_with(b"/") { Err(InventoryDeltaParseError::Invalid(format!( "{} invalid: {} (does not start with /)", kind, String::from_utf8_lossy(path) ))) } else { Ok(Some(String::from_utf8(path[1..].to_vec()).map_err( |x| { InventoryDeltaParseError::Invalid(format!( "{} invalid: {} (invalid utf8: {})", kind, String::from_utf8_lossy(path), x )) }, )?)) } } let old_path = parse_path("oldpath", oldpath_utf8)?; let new_path = parse_path("newpath", newpath_utf8)?; let new_entry = if content.starts_with(b"deleted\x00") { None } else { let name = new_path.as_ref().unwrap().rsplit_once('/').map_or_else( || new_path.as_ref().unwrap().clone(), |(_, name)| name.to_string(), ); Some(parse_inventory_entry( file_id.clone(), name, parent_id, Some(last_modified), content, )) }; Ok(InventoryDeltaEntry { old_path, new_path, file_id, new_entry, }) } #[derive(Debug)] pub enum InventoryDeltaParseError { Incompatible(String), Invalid(String), } pub fn parse_inventory_delta( lines: &[&[u8]], allow_versioned_root: Option, allow_tree_references: Option, ) -> Result<(RevisionId, RevisionId, bool, bool, InventoryDelta), InventoryDeltaParseError> { let allow_versioned_root = allow_versioned_root.unwrap_or(true); let allow_tree_references = allow_tree_references.unwrap_or(true); if lines.is_empty() { return Err(InventoryDeltaParseError::Invalid( "Invalid inventory delta is empty".to_string(), )); } if !lines[lines.len() - 1].ends_with(b"\n") { return Err(InventoryDeltaParseError::Invalid( "last line not empty".to_string(), )); } let lines = lines .iter() .map(|x| x.strip_suffix(b"\n").unwrap()) .collect::>(); if lines.is_empty() || lines[0] != [&b"format: "[..], FORMAT_1.as_bytes()].concat() { return Err(InventoryDeltaParseError::Invalid(format!( "unknown format: {}", String::from_utf8_lossy(&lines[0][8..]) ))); } if lines.len() < 2 || !lines[1].starts_with(b"parent: ") { return Err(InventoryDeltaParseError::Invalid( "missing parent: marker".to_string(), )); } let delta_parent_id = RevisionId::from(lines[1][8..].to_vec()); if lines.len() < 3 || !lines[2].starts_with(b"version: ") { return Err(InventoryDeltaParseError::Invalid( "missing version: marker".to_string(), )); } let delta_version = RevisionId::from(lines[2][9..].to_vec()); if lines.len() < 4 || !lines[3].starts_with(b"versioned_root: ") { return Err(InventoryDeltaParseError::Invalid( "missing versioned_root: marker".to_string(), )); } let delta_versioned_root = parse_bool(&lines[3][16..]).unwrap(); if !allow_versioned_root && delta_versioned_root { return Err(InventoryDeltaParseError::Incompatible( "versioned_root not allowed".to_string(), )); } if lines.len() < 5 || !lines[4].starts_with(b"tree_references: ") { return Err(InventoryDeltaParseError::Invalid( "missing tree_references: marker".to_string(), )); } let delta_tree_references = parse_bool(&lines[4][17..]).unwrap(); let mut result = Vec::new(); let mut ids = HashSet::new(); for line in lines.iter().skip(5) { let item = parse_inventory_delta_item( line, delta_versioned_root, delta_tree_references, &delta_version, )?; if !allow_tree_references && item.new_entry.is_some() && item.new_entry.as_ref().unwrap().kind() == osutils::Kind::TreeReference { return Err(InventoryDeltaParseError::Incompatible( "Tree reference not allowed".to_string(), )); } if !ids.insert(item.file_id.clone()) { return Err(InventoryDeltaParseError::Invalid(format!( "duplicate file id: {:?}", item.file_id ))); } result.push(item); } Ok(( delta_parent_id, delta_version, delta_versioned_root, delta_tree_references, InventoryDelta(result), )) } bzrformats_3.4.0.orig/crates/bazaar/src/lib.rs0000644000000000000000000001264415162074037016316 0ustar00#[cfg(feature = "pyo3")] use pyo3::{prelude::*, types::PyBytes}; use std::fmt::{Debug, Error, Formatter}; pub const DEFAULT_CHUNK_SIZE: usize = 4096; pub mod bencode_serializer; pub mod chk_inventory; pub mod chk_map; pub mod dirstate; pub mod filters; pub mod gen_ids; pub mod globbing; pub mod groupcompress; pub mod hashcache; pub mod inventory; pub mod inventory_delta; pub mod repository; pub mod revision; pub mod rio; pub mod serializer; pub mod smart; pub mod versionedfile; pub mod xml_serializer; #[cfg(feature = "pyo3")] pub mod pyversionedfile; #[derive(Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] pub struct FileId(Vec); impl Debug for FileId { fn fmt(&self, f: &mut Formatter) -> Result<(), Error> { write!(f, "{}", String::from_utf8(self.0.clone()).unwrap()) } } impl From> for FileId { fn from(v: Vec) -> Self { check_valid(&v); FileId(v) } } impl From for Vec { fn from(v: FileId) -> Self { v.0 } } impl From<&[u8]> for FileId { fn from(v: &[u8]) -> Self { check_valid(v); FileId(v.to_vec()) } } impl From<&Vec> for FileId { fn from(v: &Vec) -> Self { FileId::from(v.as_slice()) } } impl FileId { pub fn generate(name: &str) -> Self { Self::from(gen_ids::gen_file_id(name)) } pub fn generate_root_id() -> Self { Self::from(gen_ids::gen_root_id()) } pub fn as_bytes(&self) -> &[u8] { &self.0 } } #[cfg(feature = "pyo3")] impl<'a, 'py> FromPyObject<'a, 'py> for FileId { type Error = PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, PyAny>) -> PyResult { let s: Vec = ob.extract()?; Ok(FileId::from(s)) } } #[cfg(feature = "pyo3")] impl<'py> IntoPyObject<'py> for &FileId { type Target = pyo3::types::PyBytes; type Output = Bound<'py, Self::Target>; type Error = pyo3::PyErr; fn into_pyobject(self, py: Python<'py>) -> Result { Ok(PyBytes::new(py, &self.0)) } } #[cfg(feature = "pyo3")] impl<'py> IntoPyObject<'py> for FileId { type Target = pyo3::types::PyBytes; type Output = Bound<'py, Self::Target>; type Error = pyo3::PyErr; fn into_pyobject(self, py: Python<'py>) -> Result { (&self).into_pyobject(py) } } impl std::fmt::Display for FileId { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { write!(f, "{}", String::from_utf8(self.0.clone()).unwrap()) } } #[derive(Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] pub struct RevisionId(Vec); impl Debug for RevisionId { fn fmt(&self, f: &mut Formatter) -> Result<(), Error> { write!(f, "{}", String::from_utf8(self.0.clone()).unwrap()) } } impl std::fmt::Display for RevisionId { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { write!(f, "{}", String::from_utf8(self.0.clone()).unwrap()) } } impl From> for RevisionId { fn from(v: Vec) -> Self { check_valid(&v); RevisionId(v) } } impl From<&[u8]> for RevisionId { fn from(v: &[u8]) -> Self { check_valid(v); RevisionId(v.to_vec()) } } impl From for Vec { fn from(v: RevisionId) -> Self { v.0 } } #[cfg(feature = "pyo3")] impl<'a, 'py> FromPyObject<'a, 'py> for RevisionId { type Error = PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, PyAny>) -> PyResult { let s: Vec = ob.extract()?; if !is_valid(&s) { return Err(pyo3::exceptions::PyValueError::new_err(format!( "Invalid revision id: {:?}", s ))); } Ok(RevisionId::from(s)) } } #[cfg(feature = "pyo3")] impl<'py> IntoPyObject<'py> for &RevisionId { type Target = pyo3::types::PyBytes; type Output = Bound<'py, Self::Target>; type Error = pyo3::PyErr; fn into_pyobject(self, py: Python<'py>) -> Result { let obj = PyBytes::new(py, &self.0); Ok(obj) } } #[cfg(feature = "pyo3")] impl<'py> IntoPyObject<'py> for RevisionId { type Target = pyo3::types::PyBytes; type Output = Bound<'py, Self::Target>; type Error = pyo3::PyErr; fn into_pyobject(self, py: Python<'py>) -> Result { (&self).into_pyobject(py) } } pub const NULL_REVISION: &[u8] = b"null:"; pub const CURRENT_REVISION: &[u8] = b"current:"; pub fn is_valid(id: &[u8]) -> bool { if id.contains(&b' ') || id.contains(&b'\t') || id.contains(&b'\n') || id.contains(&b'\r') { return false; } if id.is_empty() { return false; } true } pub fn check_valid(id: &[u8]) { if !is_valid(id) { if let Ok(id) = String::from_utf8(id.to_vec()) { panic!("Invalid id: {:?}", id); } else { panic!("Invalid id: {:?}", id); } } } impl RevisionId { pub fn is_null(&self) -> bool { self.0 == NULL_REVISION } pub fn generate(username: &str, timestamp: Option) -> Self { Self::from(gen_ids::gen_revision_id(username, timestamp)) } pub fn as_bytes(&self) -> &[u8] { &self.0 } pub fn is_reserved(&self) -> bool { self.0.ends_with(b":") } pub fn expect_not_reserved(&self) { if self.is_reserved() { panic!("Expected non-reserved revision id, got {:?}", self); } } } bzrformats_3.4.0.orig/crates/bazaar/src/pyversionedfile.rs0000644000000000000000000002011615162074037020750 0ustar00use crate::versionedfile::{ContentFactory, Error, Key, Ordering, VersionId, VersionedFile}; use pyo3::prelude::*; use pyo3::types::PyBytes; use std::borrow::Cow; pub struct PyContentFactory(Py); impl<'py> IntoPyObject<'py> for PyContentFactory { type Target = PyAny; type Output = Bound<'py, PyAny>; type Error = PyErr; fn into_pyobject(self, py: Python<'py>) -> Result { Ok(self.0.clone_ref(py).into_bound(py)) } } impl<'a, 'py> FromPyObject<'a, 'py> for PyContentFactory { type Error = PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, PyAny>) -> PyResult { Ok(PyContentFactory(ob.to_owned().unbind())) } } impl ContentFactory for PyContentFactory { fn size(&self) -> Option { Python::attach(|py| self.0.getattr(py, "size").unwrap().extract(py).unwrap()) } fn key(&self) -> Key { Python::attach(|py| { let py_key = self.0.getattr(py, "key").unwrap(); py_key.extract(py).unwrap() }) } fn parents(&self) -> Option> { Python::attach(|py| { let py_parents = self.0.getattr(py, "parents").unwrap(); py_parents.extract(py).unwrap() }) } fn to_fulltext<'a, 'b>(&'a self) -> Cow<'b, [u8]> where 'a: 'b, { Cow::Owned(Python::attach(|py| { let py_content = self .0 .call_method1(py, "get_bytes_as", ("fulltext",)) .unwrap(); py_content.extract::>(py).unwrap() })) } fn to_lines<'a, 'b>(&'a self) -> Box>> where 'a: 'b, { let py_content = Python::attach(|py| { self.0 .call_method1(py, "get_bytes_as", ("lines",)) .unwrap() .extract::>>(py) .unwrap() }); Box::new(py_content.into_iter().map(Cow::Owned)) } fn to_chunks<'a, 'b>(&'a self) -> Box>> where 'a: 'b, { let py_content = Python::attach(|py| { self.0 .call_method1(py, "get_bytes_as", ("chunks",)) .unwrap() .extract::>>(py) .unwrap() }); Box::new(py_content.into_iter().map(Cow::Owned)) } fn into_fulltext(self) -> Vec { self.to_fulltext().into_owned() } fn into_lines(self) -> Box>> { let lines = Python::attach(|py| { let py_content = self.0.call_method1(py, "get_bytes", ("lines",)).unwrap(); py_content.extract::>>(py).unwrap() }); Box::new(lines.into_iter().map(|l| l.to_vec())) } fn into_chunks(self) -> Box>> { let chunks = Python::attach(|py| { let py_content = self.0.call_method1(py, "get_bytes", ("chunks",)).unwrap(); py_content.extract::>>(py).unwrap() }); Box::new(chunks.into_iter().map(|c| c.to_vec())) } fn sha1(&self) -> Option> { Python::attach(|py| { let py_content = self.0.call_method0(py, "get_bytes").unwrap(); py_content.extract(py).unwrap() }) } fn storage_kind(&self) -> String { Python::attach(|py| { self.0 .getattr(py, "storage_kind") .unwrap() .extract(py) .unwrap() }) } fn map_key(&mut self, _f: &dyn Fn(Key) -> Key) { todo!(); } } pub struct PyVersionedFile(Py); pub struct PyRecordStreamIter(Py); impl Iterator for PyRecordStreamIter { type Item = PyContentFactory; fn next(&mut self) -> Option { Python::attach(|py| { let py_record_stream_iter = self.0.bind(py); let py_content_factory = py_record_stream_iter.call_method0("next").unwrap(); let content_factory = PyContentFactory(py_content_factory.unbind()); Some(content_factory) }) } } impl VersionedFile> for PyVersionedFile { fn check_not_reserved_id(version_id: &VersionId) -> bool { Python::attach(|py| { let m = py.import("bzrformats.versionedfile").unwrap(); let c = m.getattr("VersionedFile").unwrap(); c.call_method1("check_not_reserved_id", (version_id,)) .unwrap() .extract() .unwrap() }) } fn has_version(&self, version_id: &VersionId) -> bool { Python::attach(|py| { let py_versioned_file = self.0.bind(py); py_versioned_file .call_method1("has_version", (version_id,)) .unwrap() .extract() .unwrap() }) } fn get_format_signature(&self) -> String { Python::attach(|py| { self.0 .call_method0(py, "get_format_signature") .unwrap() .extract(py) .unwrap() }) } fn get_record_stream( &self, version_ids: &[&VersionId], ordering: Ordering, include_delta_closure: bool, ) -> Box> { Box::new(Python::attach(|py| { let py_versioned_file = self.0.bind(py); let version_ids = version_ids.iter().collect::>(); let py_record_stream = py_versioned_file .call_method1( "get_record_stream", (version_ids, ordering, include_delta_closure), ) .unwrap(); Box::new(PyRecordStreamIter(py_record_stream.unbind())) })) } fn add_lines<'a>( &mut self, version_id: &VersionId, parent_texts: Option>>, lines: impl Iterator, nostore_sha: Option, random_id: bool, ) -> Result<(Vec, usize, Py), Error> { Python::attach(|py| { let py_versioned_file = self.0.bind(py); let py_lines = lines.map(|l| PyBytes::new(py, l)).collect::>(); let py_parent_texts = match parent_texts { Some(parent_texts) => { let py_parent_texts = parent_texts .into_iter() .map(|(k, v)| Ok((k.into_pyobject(py)?, v))) .collect::, PyErr>>()?; Some(py_parent_texts) } None => None, }; let py_result = py_versioned_file.call_method1( "add_lines", ( version_id, py_parent_texts, py_lines, nostore_sha, random_id, ), )?; let py_result = py_result.extract::<(Vec, usize, Py)>()?; Ok(py_result) }) } fn insert_record_stream( &mut self, stream: impl Iterator>, ) -> Result<(), Error> { #[pyclass(unsendable)] struct PyContentFactory(Box); #[pymethods] impl PyContentFactory { #[getter] fn sha1<'py>(&self, py: Python<'py>) -> PyResult>> { Ok(self.0.sha1().map(|o| PyBytes::new(py, &o))) } #[getter] fn key<'a>(&self, py: Python<'a>) -> PyResult> { self.0.key().into_pyobject(py) } } Python::attach(|py| { let py_versioned_file = self.0.bind(py); let stream = stream.collect::>(); let py_stream = stream.into_iter().map(PyContentFactory).collect::>(); py_versioned_file.call_method1("insert_record_stream", (py_stream,))?; Ok(()) }) } } bzrformats_3.4.0.orig/crates/bazaar/src/repository.rs0000644000000000000000000000435615162074037017770 0ustar00/// A repository format. /// /// Formats provide four things: /// * An initialization routine to construct repository data on disk. /// * a optional format string which is used when the BzrDir supports /// versioned children. /// * an open routine which returns a Repository instance. /// * A network name for referring to the format in smart server RPC /// methods. /// /// There is one and only one Format subclass for each on-disk format. But /// there can be one Repository subclass that is used for several different /// formats. The _format attribute on a Repository instance can be used to /// determine the disk format. /// /// Formats are placed in a registry by their format string for reference /// during opening. These should be subclasses of RepositoryFormat for /// consistency. /// /// Once a format is deprecated, just deprecate the initialize and open /// methods on the format class. Do not deprecate the object, as the /// object may be created even when a repository instance hasn't been /// created. /// /// Common instance attributes: /// _matchingcontroldir - the controldir format that the repository format was /// originally written to work with. This can be used if manually /// constructing a bzrdir and repository, or more commonly for test suite /// parameterization. pub trait RepositoryFormat { fn get_format_description(&self) -> String; /// Is this format supported? /// /// Supported formats must be initializable and openable. /// Unsupported formats may not support initialization or committing or /// some other features depending on the reason for not being supported. fn is_supported(&self) -> bool; /// Is this format deprecated? /// /// Deprecated formats may trigger a user-visible warning recommending /// the user to upgrade. They are still fully supported. fn is_deprecated(&self) -> bool; /// A simple byte string uniquely identifying this format for RPC calls. /// /// MetaDir repository formats use their disk format string to identify the /// repository over the wire. All in one formats such as bzr < 0.8, and /// foreign formats like svn/git and hg should use some marker which is /// unique and immutable. fn network_name(&self) -> Vec; } bzrformats_3.4.0.orig/crates/bazaar/src/revision.rs0000644000000000000000000000635515162074037017410 0ustar00use crate::RevisionId; use chrono::{DateTime, NaiveDateTime}; use std::collections::HashMap; pub fn validate_properties(properties: &HashMap>) -> bool { for (key, _value) in properties.iter() { if osutils::contains_whitespace(key.as_str()) { return false; } } true } #[derive(Clone, PartialEq)] pub struct Revision { pub revision_id: RevisionId, pub parent_ids: Vec, pub committer: Option, pub message: String, pub properties: HashMap>, pub inventory_sha1: Option>, pub timestamp: f64, pub timezone: Option, } impl Revision { pub fn new( revision_id: RevisionId, parent_ids: Vec, committer: Option, message: String, properties: HashMap>, inventory_sha1: Option>, timestamp: f64, timezone: Option, ) -> Self { Revision { revision_id, parent_ids, committer, message, properties, inventory_sha1, timestamp, timezone, } } pub fn datetime(&self) -> NaiveDateTime { DateTime::from_timestamp(self.timestamp as i64, 0) .expect("timestamp should be valid") .naive_utc() } pub fn timezone(&self) -> Option { self.timezone .map(|t| chrono::FixedOffset::east_opt(t).unwrap()) } pub fn check_properties(&self) -> bool { validate_properties(&self.properties) } pub fn get_summary(&self) -> String { if self.message.is_empty() { String::new() } else { let mut summary = self.message.trim().lines().next().unwrap().to_string(); summary = summary.trim().to_string(); summary } } fn get_property_as_str(&self, key: &str) -> Option { self.properties .get(key) .map(|x| String::from_utf8_lossy(x).to_string()) } /// Return the apparent authors of this revision. /// /// If the revision properties contain the names of the authors, /// return them. Otherwise return the committer name. /// /// The return value will be a list containing at least one element. pub fn get_apparent_authors(&self) -> Vec { let authors = match self.get_property_as_str("authors") { Some(authors) => { let authors = authors.split('\n').collect::>(); authors.iter().map(|x| x.to_string()).collect() } None => self.get_property_as_str("author").map_or( self.committer.clone().map_or(vec![], |v| vec![v]), |author| vec![author], ), }; authors.into_iter().filter(|x| !x.is_empty()).collect() } pub fn bug_urls(&self) -> Vec { self.get_property_as_str("bugs").map_or(vec![], |bugs| { bugs.split('\n').map(|x| x.to_string()).collect() }) } } impl std::fmt::Display for Revision { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { write!(f, "Revision({})", self.revision_id) } } bzrformats_3.4.0.orig/crates/bazaar/src/rio.rs0000644000000000000000000003227115162074037016337 0ustar00/// The RIO file format /// /// Copyright (C) 2023 Jelmer Vernooij /// /// Based on the Python implementation: /// Copyright (C) 2005 Canonical Ltd. /// /// \subsection{\emph{rio} - simple text metaformat} /// /// \emph{r} stands for `restricted', `reproducible', or `rfc822-like'. /// /// The stored data consists of a series of \emph{stanzas}, each of which contains /// \emph{fields} identified by an ascii name, with Unicode or string contents. /// The field tag is constrained to alphanumeric characters. /// There may be more than one field in a stanza with the same name. /// /// The format itself does not deal with character encoding issues, though /// the result will normally be written in Unicode. /// /// The format is intended to be simple enough that there is exactly one character /// stream representation of an object and vice versa, and that this relation /// will continue to hold for future versions of bzr. use regex::Regex; use std::collections::HashMap; use std::io::{BufRead, Write}; use std::iter::Iterator; use std::result::Result; use std::str; #[derive(Debug)] pub enum Error { Io(std::io::Error), InvalidTag(String), ContinuationLineWithoutTag, TagValueSeparatorNotFound(Vec), Other(String), } impl From for Error { fn from(e: std::io::Error) -> Self { Error::Io(e) } } /// Verify whether a tag is validly formatted pub fn valid_tag(tag: &str) -> bool { lazy_static::lazy_static! { static ref RE: Regex = Regex::new(r"^[-a-zA-Z0-9_]+$").unwrap(); } RE.is_match(tag) } pub struct RioWriter { soft_nl: bool, to_file: W, } impl RioWriter { pub fn new(to_file: W) -> Self { RioWriter { soft_nl: false, to_file, } } pub fn write_stanza(&mut self, stanza: &Stanza) -> Result<(), std::io::Error> { if self.soft_nl { self.to_file.write_all(b"\n")?; } stanza.write(&mut self.to_file)?; self.soft_nl = true; Ok(()) } } pub struct RioReader { from_file: R, } impl RioReader { pub fn new(from_file: R) -> Self { RioReader { from_file } } fn read_stanza(&mut self) -> Result, Error> { read_stanza_file(&mut self.from_file) } pub fn iter(&mut self) -> RioReaderIter<'_, R> { RioReaderIter { reader: self } } } pub struct RioReaderIter<'a, R: BufRead> { reader: &'a mut RioReader, } impl Iterator for RioReaderIter<'_, R> { type Item = Result, Error>; fn next(&mut self) -> Option { match self.reader.read_stanza() { Ok(stanza) => stanza.map(|s| Ok(Some(s))), Err(e) => Some(Err(e)), } } } #[derive(Debug, Clone)] pub struct Stanza { items: Vec<(String, StanzaValue)>, } #[derive(Debug, Clone, PartialEq)] pub enum StanzaValue { String(String), Stanza(Box), } impl PartialEq for Stanza { fn eq(&self, other: &Self) -> bool { if self.len() != other.len() { return false; } for (self_item, other_item) in self.items.iter().zip(other.items.iter()) { let (self_tag, self_value) = self_item; let (other_tag, other_value) = other_item; if self_tag != other_tag { return false; } if self_value != other_value { return false; } } true } } impl Stanza { pub fn new() -> Stanza { Stanza { items: vec![] } } pub fn from_pairs(pairs: Vec<(String, StanzaValue)>) -> Stanza { Stanza { items: pairs } } pub fn add(&mut self, tag: String, value: StanzaValue) -> Result<(), Error> { if !valid_tag(&tag) { return Err(Error::InvalidTag(tag)); } self.items.push((tag, value)); Ok(()) } pub fn contains(&self, find_tag: &str) -> bool { for (tag, _) in &self.items { if tag == find_tag { return true; } } false } pub fn len(&self) -> usize { self.items.len() } pub fn is_empty(&self) -> bool { self.items.is_empty() } pub fn iter_pairs(&self) -> impl Iterator { self.items.iter().map(|(tag, value)| (tag.as_str(), value)) } pub fn to_bytes_lines(&self) -> Vec> { self.to_lines() .iter() .map(|s| s.as_bytes().to_vec()) .collect() } pub fn to_lines(&self) -> Vec { let mut result = Vec::new(); for (text_tag, text_value) in &self.items { let tag = text_tag.as_bytes(); let value = match text_value { StanzaValue::String(val) => val.to_string(), StanzaValue::Stanza(val) => val.to_string(), }; if value.is_empty() { result.push(format!("{}: \n", String::from_utf8_lossy(tag))); } else if value.contains('\n') { let mut val_lines = value.split('\n'); if let Some(first_line) = val_lines.next() { result.push(format!( "{}: {}\n", String::from_utf8_lossy(tag), first_line )); } for line in val_lines { result.push(format!("\t{}\n", line)); } } else { result.push(format!("{}: {}\n", String::from_utf8_lossy(tag), value)); } } result } pub fn to_string(&self) -> String { self.to_lines().join("") } pub fn to_bytes(&self) -> Vec { self.to_string().into_bytes() } pub fn write(&self, to_file: &mut T) -> std::io::Result<()> { for line in self.to_lines() { to_file.write_all(line.as_bytes())?; } Ok(()) } pub fn get(&self, tag: &str) -> Option<&StanzaValue> { for (t, v) in &self.items { if t == tag { return Some(v); } } None } pub fn get_all(&self, tag: &str) -> Vec<&StanzaValue> { self.items .iter() .filter(|(t, _)| t == tag) .map(|(_, v)| v) .collect() } pub fn as_dict(&self) -> HashMap { let mut d = HashMap::new(); for (tag, value) in &self.items { d.insert(tag.clone(), value.clone()); } d } } impl std::default::Default for Stanza { fn default() -> Self { Stanza::new() } } pub fn read_stanza_file(line_iter: &mut dyn BufRead) -> Result, Error> { read_stanza(line_iter.split(b'\n').map(|l| { let mut vec: Vec = l?; vec.push(b'\n'); Ok(vec) })) } fn trim_newline(vec: &mut Vec) { if let Some(last_non_newline) = vec.iter().rposition(|&b| b != b'\n' && b != b'\r') { vec.truncate(last_non_newline + 1); } else { vec.clear(); } } pub fn read_stanza(lines: I) -> Result, Error> where I: Iterator, Error>>, { let mut stanza = Stanza::new(); let mut tag: Option = None; let mut accum_value: Option> = None; for bline in lines { let mut line = bline?; trim_newline(&mut line); if line.is_empty() { break; // end of stanza } else if line.starts_with(b"\t") { // continues previous value if tag.is_none() { return Err(Error::ContinuationLineWithoutTag); } if let Some(accum_value) = accum_value.as_mut() { let extra = String::from_utf8(line[1..line.len()].to_owned()).unwrap(); accum_value.push("\n".to_string() + &extra); } } else { // new tag:value line if let Some(tag) = tag.take() { let value = accum_value.take().map_or_else(String::new, |v| v.join("")); stanza.add(tag, StanzaValue::String(value))?; } let colon_index = match line.windows(2).position(|window| window.eq(b": ")) { Some(index) => index, None => return Err(Error::TagValueSeparatorNotFound(line)), }; let tagname = String::from_utf8(line[0..colon_index].to_owned()).unwrap(); if !valid_tag(&tagname) { return Err(Error::InvalidTag(tagname)); } tag = Some(tagname); let value = String::from_utf8(line[colon_index + 2..line.len()].to_owned()).unwrap(); accum_value = Some(vec![value]); } } if let Some(tag) = tag { let value = accum_value.take().map_or_else(String::new, |v| v.join("")); stanza.add(tag, StanzaValue::String(value))?; Ok(Some(stanza)) } else { // didn't see any content Ok(None) } } pub fn read_stanzas(line_iter: &mut dyn BufRead) -> Result, Error> { let mut stanzas = vec![]; while let Some(s) = read_stanza_file(line_iter)? { stanzas.push(s); } Ok(stanzas) } pub fn rio_iter( stanzas: impl IntoIterator, header: Option>, ) -> impl Iterator> { let mut lines = Vec::new(); if let Some(header) = header { let mut header = header; header.push(b'\n'); lines.push(header); } let mut first_stanza = true; for stanza in stanzas { if !first_stanza { lines.push(b"\n".to_vec()); } lines.push(stanza.to_bytes()); first_stanza = false; } lines.into_iter() } #[cfg(test)] mod tests { use super::valid_tag; use super::{read_stanza, Stanza, StanzaValue}; #[test] fn test_valid_tag() { assert!(valid_tag("name")); assert!(!valid_tag("!name")); } #[test] fn test_stanza() { let mut s = Stanza::new(); s.add("number".to_string(), StanzaValue::String("42".to_string())) .unwrap(); s.add("name".to_string(), StanzaValue::String("fred".to_string())) .unwrap(); assert!(s.contains("number")); assert!(!s.contains("color")); assert!(!s.contains("42")); // Verify that the s.get() function works assert_eq!( s.get("number"), Some(&StanzaValue::String("42".to_string())) ); assert_eq!( s.get("name"), Some(&StanzaValue::String("fred".to_string())) ); assert_eq!(s.get("color"), None); // Verify that iter_pairs() works assert_eq!(s.iter_pairs().count(), 2); } #[test] fn test_eq() { let mut s = Stanza::new(); s.add("number".to_string(), StanzaValue::String("42".to_string())) .unwrap(); s.add("name".to_string(), StanzaValue::String("fred".to_string())) .unwrap(); let mut t = Stanza::new(); t.add("number".to_string(), StanzaValue::String("42".to_string())) .unwrap(); t.add("name".to_string(), StanzaValue::String("fred".to_string())) .unwrap(); assert_eq!(s, s); assert_eq!(s, t); t.add("color".to_string(), StanzaValue::String("red".to_string())) .unwrap(); assert_ne!(s, t); } #[test] fn test_empty_value() { let s = Stanza::from_pairs(vec![( "empty".to_string(), StanzaValue::String("".to_string()), )]); assert_eq!(s.to_string(), "empty: \n"); } #[test] fn test_to_lines() { let s = Stanza::from_pairs(vec![ ("number".to_string(), StanzaValue::String("42".to_string())), ("name".to_string(), StanzaValue::String("fred".to_string())), ( "field-with-newlines".to_string(), StanzaValue::String("foo\nbar\nblah".to_string()), ), ( "special-characters".to_string(), StanzaValue::String(" \t\r\\\n ".to_string()), ), ]); assert_eq!( s.to_lines(), vec![ "number: 42\n".to_string(), "name: fred\n".to_string(), "field-with-newlines: foo\n".to_string(), "\tbar\n".to_string(), "\tblah\n".to_string(), "special-characters: \t\r\\\n".to_string(), "\t \n".to_string() ], ); } #[test] fn test_read_stanza() { let lines = b"number: 42 name: fred field-with-newlines: foo \tbar \tblah " .split(|c| *c == b'\n') .map(|s| s.to_vec()); let s = read_stanza(lines.map(Ok)).unwrap().unwrap(); let expected = Stanza::from_pairs(vec![ ("number".to_string(), StanzaValue::String("42".to_string())), ("name".to_string(), StanzaValue::String("fred".to_string())), ( "field-with-newlines".to_string(), StanzaValue::String("foo\nbar\nblah".to_string()), ), ]); assert_eq!(s, expected); } } bzrformats_3.4.0.orig/crates/bazaar/src/serializer.rs0000644000000000000000000000146415162074037017717 0ustar00use crate::revision::Revision; use std::io::Read; #[derive(Debug)] pub enum Error { DecodeError(String), EncodeError(String), IOError(std::io::Error), } impl From for Error { fn from(error: std::io::Error) -> Self { Error::IOError(error) } } pub trait RevisionSerializer: Send + Sync { fn format_name(&self) -> &'static str; fn squashes_xml_invalid_characters(&self) -> bool; fn read_revision(&self, file: &mut dyn Read) -> Result; fn write_revision_to_string(&self, revision: &Revision) -> Result, Error>; fn write_revision_to_lines( &self, revision: &Revision, ) -> Box, Error>>>; fn read_revision_from_string(&self, string: &[u8]) -> Result; } bzrformats_3.4.0.orig/crates/bazaar/src/smart/0000755000000000000000000000000015162074037016321 5ustar00bzrformats_3.4.0.orig/crates/bazaar/src/versionedfile.rs0000644000000000000000000005456215162074037020413 0ustar00use byteorder::{BigEndian, WriteBytesExt}; use pyo3::prelude::PyAnyMethods; use pyo3::types::{PyBytes, PyTuple}; use std::borrow::Cow; use std::collections::HashMap; use std::convert::TryInto; #[derive(Debug)] pub enum Error { ExistingContent(Key), VersionNotPresent(VersionId), Io(std::io::Error), } impl From for Error { fn from(e: std::io::Error) -> Error { Error::Io(e) } } #[cfg(feature = "pyo3")] impl From for pyo3::PyErr { fn from(e: Error) -> pyo3::PyErr { pyo3::import_exception!(bzrformats.errors, RevisionNotPresent); pyo3::import_exception!(bzrformats.errors, ExistingContent); match e { Error::VersionNotPresent(key) => { RevisionNotPresent::new_err(format!("Version not present: {:?}", key)) } Error::ExistingContent(key) => { ExistingContent::new_err(format!("Existing content: {:?}", key)) } Error::Io(e) => e.into(), } } } #[cfg(feature = "pyo3")] impl From for Error { fn from(e: pyo3::PyErr) -> Error { pyo3::import_exception!(bzrformats.errors, RevisionNotPresent); pyo3::import_exception!(bzrformats.errors, ExistingContent); pyo3::Python::attach(|py| { if e.is_instance_of::(py) { Error::VersionNotPresent( e.value(py) .getattr("args") .unwrap() .get_item(0) .unwrap() .extract() .unwrap(), ) } else if e.is_instance_of::(py) { Error::ExistingContent( e.value(py) .getattr("args") .unwrap() .get_item(0) .unwrap() .extract() .unwrap(), ) } else { panic!("Unexpected error: {:?}", e) } }) } } impl std::fmt::Display for Error { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { match self { Error::ExistingContent(key) => write!(f, "Existing content: {:?}", key), Error::VersionNotPresent(version) => write!(f, "Version not present: {:?}", version), Error::Io(e) => write!(f, "IO error: {}", e), } } } impl std::error::Error for Error {} pub enum Ordering { Unordered, Topological, } impl ToString for Ordering { fn to_string(&self) -> String { match self { Ordering::Unordered => "unordered".to_string(), Ordering::Topological => "topological".to_string(), } } } #[cfg(feature = "pyo3")] impl<'py> pyo3::IntoPyObject<'py> for Ordering { type Target = pyo3::types::PyString; type Output = pyo3::Bound<'py, Self::Target>; type Error = pyo3::PyErr; fn into_pyobject(self, py: pyo3::Python<'py>) -> Result { Ok(self.to_string().into_pyobject(py)?) } } #[cfg(feature = "pyo3")] impl<'a, 'py> pyo3::FromPyObject<'a, 'py> for Ordering { type Error = pyo3::PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, pyo3::PyAny>) -> pyo3::PyResult { let s = ob.extract::()?; match s.as_str() { "unordered" => Ok(Ordering::Unordered), "topological" => Ok(Ordering::Topological), _ => Err(pyo3::exceptions::PyValueError::new_err(format!( "Expected 'unordered' or 'topological', got '{}'", s ))), } } } #[derive(Clone, Debug, PartialEq, Eq, Hash)] pub struct VersionId(Vec); #[cfg(feature = "pyo3")] impl<'py> pyo3::IntoPyObject<'py> for &VersionId { type Target = pyo3::types::PyBytes; type Output = pyo3::Bound<'py, Self::Target>; type Error = pyo3::PyErr; fn into_pyobject(self, py: pyo3::Python<'py>) -> Result { let bytes = PyBytes::new(py, &self.0); Ok(bytes.into_pyobject(py)?) } } #[cfg(feature = "pyo3")] impl<'a, 'py> pyo3::FromPyObject<'a, 'py> for VersionId { type Error = pyo3::PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, pyo3::PyAny>) -> pyo3::PyResult { let bytes = ob.extract::>()?; Ok(VersionId(bytes)) } } impl std::fmt::Display for VersionId { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { write!(f, "VersionId({:?})", self.0)?; Ok(()) } } #[derive(Clone, Debug, PartialEq, Eq, Hash)] pub enum Key { Fixed(Vec>), ContentAddressed(Vec>), } impl Key { pub fn add_prefix(&mut self, prefix: &[&[u8]]) { let v = match self { Key::Fixed(ref mut v) => v, Key::ContentAddressed(ref mut v) => v, }; for p in prefix.iter().rev() { v.insert(0, p.to_vec()); } } } impl std::fmt::Display for Key { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { match self { Key::Fixed(v) => { write!(f, "(")?; for (i, v) in v.iter().enumerate() { if i > 0 { write!(f, ", ")?; } write!(f, "{:?}", v)?; } write!(f, ")") } Key::ContentAddressed(v) => { write!(f, "(")?; for v in v.iter() { write!(f, "{:?}", v)?; write!(f, ", ")?; } write!(f, "")?; write!(f, ")") } } } } #[cfg(feature = "pyo3")] impl<'a, 'py> pyo3::FromPyObject<'a, 'py> for Key { type Error = pyo3::PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, pyo3::PyAny>) -> pyo3::PyResult { use pyo3::prelude::*; // Look at the type name, stripping out the module name. match ob .get_type() .name() .unwrap() .to_string() .split('.') .next_back() .unwrap() { "tuple" | "StaticTuple" => {} _ => { return Err(pyo3::exceptions::PyTypeError::new_err(format!( "Expected tuple or StaticTuple, got {}", ob.get_type().name().unwrap() ))); } } let mut v = Vec::with_capacity(ob.len()?); for i in 0..ob.len()? - 1 { let b = ob.get_item(i)?.extract::>()?; v.push(b.as_bytes().to_vec()); } if let Some(b) = ob .get_item(ob.len()? - 1)? .extract::>>()? { v.push(b.as_bytes().to_vec()); Ok(Key::Fixed(v)) } else { Ok(Key::ContentAddressed(v)) } } } #[cfg(feature = "pyo3")] impl<'py> pyo3::IntoPyObject<'py> for Key { type Target = pyo3::types::PyTuple; type Output = pyo3::Bound<'py, Self::Target>; type Error = pyo3::PyErr; fn into_pyobject(self, py: pyo3::Python<'py>) -> Result { match self { Key::Fixed(v) => { let t = PyTuple::new( py, v.into_iter() .map(|v| pyo3::types::PyBytes::new(py, v.as_slice())), ); t } Key::ContentAddressed(v) => { let mut entries = v .into_iter() .map(|v| pyo3::types::PyBytes::new(py, v.as_slice()).into_any()) .collect::>(); entries.push(py.None().into_bound(py).into_any()); PyTuple::new(py, entries) } } } } impl bendy::encoding::ToBencode for Key { const MAX_DEPTH: usize = 10; fn encode( &self, encoder: bendy::encoding::SingleItemEncoder<'_>, ) -> Result<(), bendy::encoding::Error> { match self { Key::Fixed(v) => encoder.emit_list(|e| { for v in v.iter() { e.emit_bytes(v)?; } Ok(()) }), Key::ContentAddressed(_v) => { panic!("ContentAddressed keys are not supported in bencode") } } } } #[test] fn test_key_bencode() { let x = Key::Fixed(vec![b"foo".to_vec(), b"bar".to_vec()]); let z = bendy::encoding::ToBencode::to_bencode(&x).unwrap(); assert_eq!(z, b"l3:foo3:bare".to_vec()); } pub trait ContentFactory { /// None, or the sha1 of the content fulltext fn sha1(&self) -> Option>; /// None, or the size of the content fulltext. fn size(&self) -> Option; /// The key of this content. Each key is a tuple with a single string in it. fn key(&self) -> Key; /// A tuple of parent keys for self.key. If the object has no parent information, None (as /// opposed to () for an empty list of parents). fn parents(&self) -> Option>; fn to_fulltext<'a, 'b>(&'a self) -> Cow<'b, [u8]> where 'a: 'b; fn to_chunks<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b; fn to_lines<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b; fn into_fulltext(self) -> Vec; fn into_chunks(self) -> Box>>; fn into_lines(self) -> Box>> where Self: Sized, { Box::new( osutils::chunks_to_lines(self.into_chunks().map(Ok::<_, std::io::Error>)) .map(|v| v.unwrap().into_owned()), ) } fn storage_kind(&self) -> String; fn map_key(&mut self, f: &dyn Fn(Key) -> Key); } pub struct FulltextContentFactory { sha1: Option>, size: usize, key: Key, parents: Option>, fulltext: Vec, } impl ContentFactory for FulltextContentFactory { fn sha1(&self) -> Option> { self.sha1.clone() } fn size(&self) -> Option { Some(self.size) } fn key(&self) -> Key { self.key.clone() } fn parents(&self) -> Option> { self.parents.clone() } fn to_fulltext<'a, 'b>(&'a self) -> Cow<'b, [u8]> where 'a: 'b, { Cow::Borrowed(&self.fulltext) } fn to_chunks<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b, { Box::new( self.fulltext .as_slice() .chunks(crate::DEFAULT_CHUNK_SIZE) .map(|v| v.into()), ) } fn to_lines<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b, { Box::new( osutils::chunks_to_lines(std::iter::once(Ok::<_, std::io::Error>( &self.fulltext, ))) .map(|v| v.unwrap()), ) } fn into_fulltext(self) -> Vec { self.fulltext } fn into_chunks(self) -> Box>> { let mut fulltext = self.fulltext; Box::new(std::iter::from_fn(move || { if fulltext.is_empty() { None } else { let chunk = fulltext .drain(..std::cmp::min(crate::DEFAULT_CHUNK_SIZE, fulltext.len())) .collect::>(); Some(chunk) } })) } fn storage_kind(&self) -> String { "fulltext".into() } fn map_key(&mut self, f: &dyn Fn(Key) -> Key) { self.key = f(self.key.clone()); self.parents = self.parents.take().map(|v| v.into_iter().map(f).collect()); } } impl FulltextContentFactory { pub fn new( sha1: Option>, key: Key, parents: Option>, fulltext: Vec, ) -> Self { Self { sha1, size: fulltext.len(), key, parents, fulltext, } } } pub struct ChunkedContentFactory { sha1: Option>, size: usize, key: Key, parents: Option>, chunks: Vec>, } impl ChunkedContentFactory { pub fn new( sha1: Option>, key: Key, parents: Option>, chunks: Vec>, ) -> Self { Self { sha1, size: chunks.iter().map(|v| v.len()).sum(), key, parents, chunks, } } } impl ContentFactory for ChunkedContentFactory { fn sha1(&self) -> Option> { self.sha1.clone() } fn size(&self) -> Option { Some(self.size) } fn key(&self) -> Key { self.key.clone() } fn parents(&self) -> Option> { self.parents.clone() } fn to_fulltext<'a, 'b>(&'a self) -> Cow<'b, [u8]> where 'a: 'b, { self.chunks.concat().into() } fn to_chunks<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b, { Box::new(self.chunks.iter().map(|v| v.into())) } fn to_lines<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b, { Box::new( osutils::chunks_to_lines(self.chunks.iter().map(Ok::<_, std::io::Error>)) .map(|l| l.unwrap()), ) } fn into_fulltext(self) -> Vec { self.chunks.into_iter().flatten().collect() } fn into_chunks(self) -> Box>> { Box::new(self.chunks.into_iter()) } fn storage_kind(&self) -> String { "chunked".into() } fn map_key(&mut self, f: &dyn Fn(Key) -> Key) { self.key = f(self.key.clone()); self.parents = self.parents.take().map(|v| v.into_iter().map(f).collect()); } } pub struct AbsentContentFactory { key: Key, } impl AbsentContentFactory { pub fn new(key: Key) -> Self { Self { key } } } impl ContentFactory for AbsentContentFactory { fn sha1(&self) -> Option> { None } fn size(&self) -> Option { None } fn key(&self) -> Key { self.key.clone() } fn parents(&self) -> Option> { None } fn to_fulltext<'a, 'b>(&'a self) -> Cow<'b, [u8]> where 'a: 'b, { panic!("A request was made for key: {}, but that content is not available, and the calling code does not handle if it is missing.", self.key); } fn to_chunks<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b, { panic!("A request was made for key: {}, but that content is not available, and the calling code does not handle if it is missing.", self.key); } fn to_lines<'a, 'b>(&'a self) -> Box> + 'b> where 'a: 'b, { panic!("A request was made for key: {}, but that content is not available, and the calling code does not handle if it is missing.", self.key); } fn into_fulltext(self) -> Vec { panic!("A request was made for key: {}, but that content is not available, and the calling code does not handle if it is missing.", self.key); } fn into_chunks(self) -> Box>> { panic!("A request was made for key: {}, but that content is not available, and the calling code does not handle if it is missing.", self.key); } fn storage_kind(&self) -> String { "absent".into() } fn map_key(&mut self, f: &dyn Fn(Key) -> Key) { self.key = f(self.key.clone()); } } pub trait VersionedFile { fn check_not_reserved_id(id: &VersionId) -> bool; fn get_record_stream( &self, keys: &[&VersionId], ordering: Ordering, include_delta_closure: bool, ) -> Box>; fn add_lines<'a>( &mut self, version_id: &VersionId, parent_texts: Option>, lines: impl Iterator, nostore_sha: Option, random_id: bool, ) -> Result<(Vec, usize, I), Error>; fn has_version(&self, version_id: &VersionId) -> bool; fn insert_record_stream( &mut self, stream: impl Iterator>, ) -> Result<(), Error>; fn get_format_signature(&self) -> String; fn get_lines( &self, version_id: &VersionId, ) -> Result>>, Error> { let record_stream = self.get_record_stream(&[version_id], Ordering::Unordered, false); if let Some(record) = record_stream.into_iter().next() { Ok(record.into_lines()) } else { Err(Error::VersionNotPresent(version_id.clone())) } } fn get_text(&self, version_id: &VersionId) -> Result, Error> { let record_stream = self.get_record_stream(&[version_id], Ordering::Unordered, false); if let Some(record) = record_stream.into_iter().next() { Ok(record.into_fulltext()) } else { Err(Error::VersionNotPresent(version_id.clone())) } } fn get_chunks( &self, version_id: &VersionId, ) -> Result>>, Error> { let record_stream = self.get_record_stream(&[version_id], Ordering::Unordered, false); if let Some(record) = record_stream.into_iter().next() { Ok(record.into_chunks()) } else { Err(Error::VersionNotPresent(version_id.clone())) } } } /// Storage for many versioned files. /// /// This object allows a single keyspace for accessing the history graph and /// contents of named bytestrings. /// /// Currently no implementation allows the graph of different key prefixes to /// intersect, but the API does allow such implementations in the future. /// /// The keyspace is expressed via simple tuples. Any instance of VersionedFiles /// may have a different length key-size, but that size will be constant for /// all texts added to or retrieved from it. For instance, breezy uses /// instances with a key-size of 2 for storing user files in a repository, with /// the first element the fileid, and the second the version of that file. /// /// The use of tuples allows a single code base to support several different /// uses with only the mapping logic changing from instance to instance. /// /// :ivar _immediate_fallback_vfs: For subclasses that support stacking, /// this is a list of other VersionedFiles immediately underneath this /// one. They may in turn each have further fallbacks. pub trait VersionedFiles { fn check_not_reserved_id(id: &VersionId) -> bool; fn get_record_stream( &self, keys: &[&Key], ordering: Ordering, include_delta_closure: bool, ) -> Box>; } pub fn record_to_fulltext_bytes( record: R, w: &mut W, ) -> std::io::Result<()> { let mut record_meta = bendy::encoding::Encoder::new(); record_meta .emit_list(|e| { e.emit(record.key())?; if let Some(parents) = record.parents() { e.emit_list(|e| { for parent in parents { e.emit(parent)?; } Ok(()) })?; } else { e.emit_bytes(&b"nil"[..])?; // default to a single byte vector containing "nil" } Ok(()) }) .unwrap(); let record_meta = record_meta.get_output().unwrap(); w.write_all(b"fulltext\n")?; w.write_all(&length_prefix(&record_meta))?; w.write_all(&record_meta)?; w.write_all(&record.into_fulltext())?; Ok(()) } fn length_prefix(data: &[u8]) -> Vec { let length = data.len() as u32; let mut length_bytes = vec![]; // Write the length as a 4-byte big-endian representation length_bytes .write_u32::(length) .expect("Failed to write length bytes"); length_bytes } pub fn fulltext_network_to_record(bytes: &[u8], line_end: usize) -> FulltextContentFactory { // Extract meta_len from the network fulltext record let meta_len_bytes: [u8; 4] = bytes[line_end..line_end + 4] .try_into() .expect("Expected 4 bytes for meta_len"); let meta_len = u32::from_be_bytes(meta_len_bytes) as usize; // Extract record_meta using meta_len let record_meta = &bytes[line_end + 4..line_end + 4 + meta_len]; // Decode record_meta using Bencode let mut decoder = bendy::decoding::Decoder::new(record_meta); let mut tuple = decoder .next_object() .expect("Failed to decode record_meta using Bencode") .expect("Failed to decode tuple using Bencode") .try_into_list() .unwrap(); fn decode_key(o: bendy::decoding::Object) -> Key { let mut ret = vec![]; let mut l = o.try_into_list().unwrap(); while let Some(b) = l.next_object().unwrap() { ret.push(b.try_into_bytes().unwrap().to_vec()); } Key::Fixed(ret) } let key = decode_key( tuple .next_object() .expect("Failed to decode record_meta using Bencode") .expect("Failed to decode key using Bencode"), ); let parents = tuple .next_object() .expect("Failed to decode record_meta using Bencode") .expect("Failed to decode parents using Bencode"); // Convert parents from "nil" to None let parents = match parents { bendy::decoding::Object::Bytes(bytes) => { if bytes == b"nil" { None } else { panic!("Expected parents to be a list or nil"); } } bendy::decoding::Object::List(mut l) => { let mut parents = vec![]; while let Some(parent) = l.next_object().unwrap() { parents.push(decode_key(parent)); } Some(parents) } _ => panic!("Expected parents to be a list or nil"), }; // Extract fulltext from the remaining bytes let fulltext = &bytes[line_end + 4 + meta_len..]; FulltextContentFactory::new(None, key, parents, fulltext.to_vec()) } bzrformats_3.4.0.orig/crates/bazaar/src/xml_serializer.rs0000644000000000000000000003416515162074037020603 0ustar00#![allow(dead_code)] use crate::revision::Revision; use crate::serializer::{Error, RevisionSerializer}; use crate::RevisionId; use lazy_regex::regex_replace_all; use std::collections::HashMap; use std::io::{BufRead, Read, Write}; use std::str; use xmltree::Element; fn escape_low(c: u8) -> Option<&'static str> { match c { b'&' => Some("&"), b'\'' => Some("'"), b'"' => Some("""), b'<' => Some("<"), b'>' => Some(">"), _ => None, } } fn unicode_escape_replace(cap: ®ex::Captures) -> String { let m = cap.get(0).unwrap(); assert_eq!(m.as_str().chars().count(), 1,); let c = m.as_str().chars().next().unwrap(); if m.as_str().len() == 1 { if let Some(ret) = escape_low(m.as_str().as_bytes()[0]) { return ret.to_string(); } } format!("&#{};", c as u32) } fn utf8_escape_replace(cap: ®ex::bytes::Captures) -> Vec { let m = cap.get(0).unwrap().as_bytes(); if m.len() == 1 { if let Some(ret) = escape_low(m[0]) { return ret.as_bytes().to_vec(); } } let utf8 = str::from_utf8(m).unwrap(); utf8.chars() .map(|c| format!("&#{};", c as u64).into_bytes()) .collect::>>() .concat() } pub fn encode_and_escape_string(text: &str) -> String { regex_replace_all!(r#"[&<>'"\u{007f}-\u{ffff}]"#, text, unicode_escape_replace).into_owned() } pub fn encode_and_escape_bytes(data: &[u8]) -> String { let bytes = regex_replace_all!(r#"(?-u)[&<>'"]|[\x7f-\xff]+"#B, data, utf8_escape_replace).into_owned(); String::from_utf8_lossy(bytes.as_slice()).to_string() } fn escape_invalid_char(c: char) -> String { if c == '\t' || c == '\n' || c == '\r' || c == '\x7f' { c.to_string() } else if c.is_ascii_control() || (c as u32) > 0xD7FF && (c as u32) < 0xE000 || (c as u32) > 0xFFFD && (c as u32) < 0x10000 { format!("\\x{:02x}", c as u32) } else { c.to_string() } } pub fn escape_invalid_chars(message: &str) -> String { message .chars() .map(escape_invalid_char) .collect::>() .join("") } fn unpack_revision_properties(elt: &xmltree::Element) -> Result>, Error> { if let Some(props_elt) = elt.get_child("properties") { let mut properties = HashMap::new(); for child in props_elt.children.iter() { let child = child.as_element().ok_or_else(|| { Error::DecodeError(format!("bad tag under properties list: {:?}", child)) })?; if child.name != "property" { return Err(Error::DecodeError(format!( "bad tag under properties list: {:?}", child ))); } let name = child.attributes.get("name").ok_or_else(|| { Error::DecodeError("property element missing name attribute".to_owned()) })?; let value = child .get_text() .map_or_else(Vec::new, |s| s.as_bytes().to_vec()); properties.insert(name.clone(), value); } Ok(properties) } else { Ok(HashMap::new()) } } // TODO(jelmer): Move this to somewhere more central? fn surrogate_escape(b: u8) -> Vec { let hi = 0xDC80 + ((b >> 4) as u32); let lo = 0xDC00 + ((b & 0x0F) as u32); let mut result = Vec::new(); result.extend_from_slice(&hi.to_be_bytes()); result.extend_from_slice(&lo.to_be_bytes()); result } fn utf8_encode_surrogate(codepoint: u32) -> Vec { let mut result = Vec::new(); if codepoint < 0x80 { result.push(codepoint as u8); } else if codepoint < 0x800 { result.push(((codepoint >> 6) & 0x1F) as u8 | 0xC0); result.push((codepoint & 0x3F) as u8 | 0x80); } else if codepoint < 0x10000 { result.push(((codepoint >> 12) & 0x0F) as u8 | 0xE0); result.push(((codepoint >> 6) & 0x3F) as u8 | 0x80); result.push((codepoint & 0x3F) as u8 | 0x80); } else if codepoint < 0x110000 { result.push(((codepoint >> 18) & 0x07) as u8 | 0xF0); result.push(((codepoint >> 12) & 0x3F) as u8 | 0x80); result.push(((codepoint >> 6) & 0x3F) as u8 | 0x80); result.push((codepoint & 0x3F) as u8 | 0x80); } else { panic!("Invalid codepoint: {}", codepoint); } result } fn decode_pep838(bytes: &[u8], surrogate_fn: F, other_fn: G) -> String where F: Fn(u32) -> String, G: Fn(char) -> String, { let mut result = Vec::new(); let mut i = 0; while i < bytes.len() { let byte = bytes[i]; if byte & 0x80 == 0 { // single-byte character result.push(other_fn(byte as char)); i += 1; } else if byte & 0xE0 == 0xC0 { // two-byte character if i + 1 < bytes.len() { let c = (((byte & 0x1F) as u32) << 6) | ((bytes[i + 1] & 0x3F) as u32); result.push(other_fn(char::from_u32(c).unwrap())); } else { result.push(other_fn('\u{FFFD}')); } i += 2; } else if byte & 0xF0 == 0xE0 { // three-byte character if i + 2 < bytes.len() { let c = (((byte & 0x0F) as u32) << 12) | (((bytes[i + 1] & 0x3F) as u32) << 6) | ((bytes[i + 2] & 0x3F) as u32); result.push(other_fn(char::from_u32(c).unwrap())); } else { result.push(other_fn('\u{FFFD}')); } i += 3; } else if byte & 0xF8 == 0xF0 { // four-byte character if i + 3 < bytes.len() { let high = ((byte & 0x07) as u16) << 2 | ((bytes[i + 1] & 0x30) >> 4) as u16; let low = ((bytes[i + 1] & 0x0F) as u16) << 6 | (bytes[i + 2] & 0x3F) as u16; result.push(surrogate_fn(((high as u32) << 16) | (low as u32))); i += 4; } else { result.push(other_fn('\u{FFFD}')); i += 1; } } else { // invalid character result.push(other_fn('\u{FFFD}')); i += 1; } } result.concat() } impl RevisionSerializer for T { fn format_name(&self) -> &'static str { self.format_num() } fn squashes_xml_invalid_characters(&self) -> bool { true } fn read_revision(&self, file: &mut dyn Read) -> Result { let element = Element::parse(file) .map_err(|e| Error::DecodeError(format!("XML parse error: {}", e)))?; self.unpack_revision(element) } fn read_revision_from_string(&self, text: &[u8]) -> Result { let mut cursor = std::io::Cursor::new(text); self.read_revision(&mut cursor) } fn write_revision_to_lines( &self, rev: &Revision, ) -> Box, Error>>> { let buf = self.write_revision_to_string(rev); if let Ok(buf) = buf { let cursor = std::io::Cursor::new(buf); let mut reader = std::io::BufReader::new(cursor); Box::new(std::iter::from_fn(move || { let mut line = Vec::new(); match reader.read_until(b'\n', &mut line) { Ok(0) => None, Ok(_) => Some(Ok(line)), Err(e) => Some(Err(Error::IOError(e))), } })) } else { Box::new(std::iter::once(Err(Error::EncodeError( "Failed to write revision to string".to_string(), )))) } } fn write_revision_to_string(&self, rev: &Revision) -> Result, Error> { let mut buf = Vec::new(); buf.write_all(b"\n")?; let message = encode_and_escape_string(escape_invalid_chars(rev.message.as_str()).as_str()); buf.write_all(format!("{}\n", message).as_bytes())?; if !rev.parent_ids.is_empty() { buf.write_all(b"\n")?; for parent_id in &rev.parent_ids { if parent_id.is_reserved() { panic!("reserved revision id used as parent: {}", parent_id); } buf.write_all( format!( "\n", encode_and_escape_bytes(parent_id.as_bytes()) ) .as_bytes(), )?; } buf.write_all(b"\n")?; } if !rev.properties.is_empty() { buf.write_all(b"")?; let mut sorted_keys: Vec<_> = rev.properties.keys().collect(); sorted_keys.sort(); for prop_name in sorted_keys { let prop_value = rev.properties.get(prop_name).unwrap(); if !prop_value.is_empty() { buf.write_all( format!( "", encode_and_escape_string(prop_name) ) .as_bytes(), )?; let prop_value = decode_pep838( prop_value, |c| { utf8_encode_surrogate(c) .iter() .map(|x| format!("\\x{:02x}", *x as u32)) .collect() }, escape_invalid_char, ); buf.write_all(encode_and_escape_string(prop_value.as_str()).as_bytes())?; buf.write_all(b"\n")?; } else { buf.write_all( format!( "\n", encode_and_escape_string(prop_name) ) .as_bytes(), )?; } } buf.write_all(b"\n")?; } buf.write_all(b"\n")?; Ok(buf) } } pub trait XMLRevisionSerializer: RevisionSerializer { fn format_num(&self) -> &'static str; fn unpack_revision(&self, document: xmltree::Element) -> Result { if document.name != "revision" { return Err(Error::DecodeError(format!( "expected revision element, got {}", document.name ))); } if let Some(format) = document.attributes.get("format") { if format != self.format_num() { return Err(Error::DecodeError(format!( "invalid format version {} on revision", format ))); } } let parents_ids = document .get_child("parents") .map_or_else(std::vec::Vec::new, |e| { e.children .iter() .filter_map(|n| n.as_element()) .map(|c| RevisionId::from(c.attributes.get("revision_id").unwrap().as_bytes())) .collect() }); let timezone = document .attributes .get("timezone") .map_or_else(|| None, |v| Some(v.parse::().unwrap())); let message = document.get_child("message").map_or_else( || "".to_string(), |e| { e.get_text() .map_or_else(|| "".to_owned(), |t| t.to_string()) }, ); let revision_id = RevisionId::from( document .attributes .get("revision_id") .ok_or_else(|| { Error::EncodeError("revision element missing revision_id attribute".to_owned()) })? .as_bytes(), ); let committer = document.attributes.get("committer").map(|s| s.to_owned()); let properties = unpack_revision_properties(&document)?; let inventory_sha1 = document .attributes .get("inventory_sha1") .map(|s| s.as_bytes().to_vec()); let timestamp = document .attributes .get("timestamp") .ok_or_else(|| { Error::EncodeError("revision element missing timestamp attribute".to_owned()) })? .parse::() .unwrap(); Ok(Revision::new( revision_id, parents_ids, committer, message, properties, inventory_sha1, timestamp, timezone, )) } } pub struct XMLRevisionSerializer8; impl XMLRevisionSerializer for XMLRevisionSerializer8 { fn format_num(&self) -> &'static str { "8" } } pub struct XMLRevisionSerializer5; impl XMLRevisionSerializer for XMLRevisionSerializer5 { fn format_num(&self) -> &'static str { "5" } } bzrformats_3.4.0.orig/crates/bazaar/src/groupcompress/block.rs0000644000000000000000000004275115162074037021554 0ustar00use crate::groupcompress::delta::{apply_delta, read_base128_int, read_instruction, Instruction}; use byteorder::ReadBytesExt; use std::borrow::Cow; use std::io::BufRead; use std::io::{Read, Write}; /// Group Compress Block v1 Zlib const GCB_HEADER: &[u8] = b"gcb1z\n"; /// Group Compress Block v1 Lzma const GCB_LZ_HEADER: &[u8] = b"gcb1l\n"; #[derive(PartialEq, Eq, Default, Clone, Copy)] pub enum CompressorKind { #[default] Zlib, Lzma, } #[cfg(feature = "pyo3")] impl<'a, 'py> pyo3::FromPyObject<'a, 'py> for CompressorKind { type Error = pyo3::PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, pyo3::PyAny>) -> pyo3::PyResult { let s: Cow = ob.extract()?; match s.as_ref() { "zlib" => Ok(CompressorKind::Zlib), "lzma" => Ok(CompressorKind::Lzma), _ => Err(pyo3::exceptions::PyValueError::new_err(format!( "Unknown compressor: {}", s ))), } } } impl CompressorKind { fn header(&self) -> &'static [u8] { match self { CompressorKind::Zlib => GCB_HEADER, CompressorKind::Lzma => GCB_LZ_HEADER, } } fn from_header(header: &[u8]) -> Option { if header == GCB_HEADER { Some(CompressorKind::Zlib) } else if header == GCB_LZ_HEADER { Some(CompressorKind::Lzma) } else { None } } } #[derive(Debug)] pub enum Error { InvalidData(String), Io(std::io::Error), } impl From for Error { fn from(e: std::io::Error) -> Self { Error::Io(e) } } impl std::fmt::Display for Error { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { match *self { Error::InvalidData(ref s) => write!(f, "Invalid data: {}", s), Error::Io(ref e) => write!(f, "IO error: {}", e), } } } impl std::error::Error for Error {} pub enum GroupCompressItem { Fulltext(Vec), Delta(Vec), } pub fn read_item(r: &mut R) -> Result { // The bytes are 'f' or 'd' for the type, then a variable-length // base128 integer for the content size, then the actual content // We know that the variable-length integer won't be longer than 5 // bytes (it takes 5 bytes to encode 2^32) let c = r.read_u8()?; let content_len = read_base128_int(r).map_err(|e| Error::InvalidData(e.to_string()))?; let mut text = vec![0; content_len as usize]; r.read_exact(&mut text)?; match c { b'f' => { // Fulltext Ok(GroupCompressItem::Fulltext(text)) } b'd' => { // Must be type delta as checked above Ok(GroupCompressItem::Delta(text)) } c => Err(Error::InvalidData(format!( "Unknown content control code: {:?}", c ))), } } /// An object which maintains the internal structure of the compressed data. /// /// This tracks the meta info (start of text, length, type, etc.) pub struct GroupCompressBlock { /// The name of the compressor used to compress the content compressor: Option, /// The compressed content z_content_chunks: Option>>, /// The decompressor object z_content_decompressor: Option>, /// The length of the compressed content z_content_length: Option, /// The length of the uncompressed content content_length: Option, /// The uncompressed content content: Option>, /// The uncompressed content, split into chunks content_chunks: Option>>, } impl Default for GroupCompressBlock { fn default() -> Self { Self::new() } } fn read_header(r: &mut R) -> Result { let mut header = [0; 6]; r.read_exact(&mut header).map_err(|e| { Error::InvalidData(format!( "Failed to read header from GroupCompressBlock: {}", e )) })?; CompressorKind::from_header(&header).ok_or_else(|| { Error::InvalidData(format!( "Invalid header in GroupCompressBlock: {:?}", header )) }) } impl GroupCompressBlock { pub fn new() -> Self { // map by key? or just order in file? Self { compressor: None, z_content_chunks: None, z_content_decompressor: None, z_content_length: None, content_length: None, content: None, content_chunks: None, } } pub fn content(&self) -> Option<&[u8]> { self.content.as_deref() } pub fn content_length(&self) -> Option { self.content_length } /// Make sure that content has been expanded enough. /// /// # Arguments /// * `num_bytes` - Ensure that we have extracted at least num_bytes of content. If None, consume everything pub fn ensure_content(&mut self, num_bytes: Option) { assert!( self.content_length.is_some(), "self.content_length should never be None" ); let mut num_bytes = match num_bytes { None => self.content_length.unwrap(), Some(num_bytes) => { assert!( num_bytes <= self.content_length.unwrap(), "requested num_bytes ({}) > content length ({})", num_bytes, self.content_length.unwrap() ); num_bytes } }; // Expand the content if required if self.content.is_none() { if let Some(content_chunks) = self.content_chunks.as_ref() { self.content = Some(content_chunks.concat()); self.content_chunks = None; } } if self.content.is_none() { // We join self.z_content_chunks here, because if we are // decompressing, then it is *very* likely that we have a single // chunk if self.z_content_length == Some(0) { self.content = Some(b"".to_vec()); } else { let c = osutils::chunkreader::ChunksReader::new(Box::new( self.z_content_chunks.clone().unwrap().into_iter(), )); self.z_content_decompressor = Some(match self.compressor.unwrap() { CompressorKind::Lzma => { Box::new(xz2::read::XzDecoder::new(c)) as Box } CompressorKind::Zlib => { Box::new(flate2::read::ZlibDecoder::new(c)) as Box } }); self.content = Some(Vec::new()); } } if self.content.as_ref().unwrap().len() >= num_bytes { return; } num_bytes -= self.content.as_ref().unwrap().len(); let mut buf = vec![0; num_bytes]; self.z_content_decompressor .as_mut() .unwrap() .read_exact(&mut buf) .unwrap(); self.content.as_mut().unwrap().extend(buf); } #[allow(clippy::len_without_is_empty)] pub fn len(&self) -> usize { // This is the maximum number of bytes this object will reference if // everything is decompressed. However, if we decompress less than // everything... (this would cause some problems for LRUSizeCache) self.content_length.unwrap() + self.z_content_length.unwrap() } pub fn parse_bytes(&mut self, mut data: &[u8]) -> Result<(), Error> { self.read_bytes(&mut data) } /// Read the various lengths from the header. /// /// This also populates the various 'compressed' buffers. fn read_bytes(&mut self, r: &mut R) -> Result<(), Error> { // At present, we have 2 integers for the compressed and uncompressed // content. In base10 (ascii) 14 bytes can represent > 1TB, so to avoid // checking too far, cap the search to 14 bytes. let mut buf = std::io::BufReader::new(r); let mut z_content_length_buf = Vec::new(); buf.read_until(b'\n', &mut z_content_length_buf)?; // Chop off the '\n' z_content_length_buf.pop(); self.z_content_length = Some( String::from_utf8(z_content_length_buf) .unwrap() .parse() .unwrap(), ); let mut content_length_buf = Vec::new(); buf.read_until(b'\n', &mut content_length_buf)?; content_length_buf.pop(); self.content_length = Some( String::from_utf8(content_length_buf) .unwrap() .parse() .unwrap(), ); let mut data = Vec::new(); buf.read_to_end(&mut data)?; // XXX: Define some GCCorrupt error ? assert_eq!( data.len(), self.z_content_length.unwrap(), "Invalid bytes: ({}) != {}", data.len(), self.z_content_length.unwrap() ); self.z_content_chunks = Some(vec![data.to_vec()]); Ok(()) } /// Return z_content_chunks as a simple string. /// /// Meant only to be used by the test suite. pub fn z_content(&mut self) -> Vec { self.z_content_chunks.as_ref().unwrap().concat() } pub fn z_content_chunks(&mut self) -> &mut Vec> { self.z_content_chunks.as_mut().unwrap() } pub fn from_bytes(mut r: R) -> Result { let compressor = read_header(&mut r)?; let mut out = Self { compressor: Some(compressor), z_content_chunks: None, content: None, content_chunks: None, z_content_length: None, content_length: None, z_content_decompressor: None, }; out.read_bytes(&mut r)?; Ok(out) } /// Extract the text for a specific key. /// /// # Arguments /// * `key` - The label used for this content /// * `sha1` - TODO (should we validate only when sha1 is supplied?) /// /// # Returns /// The bytes for the content pub fn extract(&mut self, start: usize, end: usize) -> Result>, Error> { if start == 0 && end == 0 { return Ok(vec![]); } self.ensure_content(Some(end)); let mut content = self.content.as_ref().unwrap().as_slice(); match read_item(&mut content)? { GroupCompressItem::Fulltext(data) => Ok(vec![data]), GroupCompressItem::Delta(text) => Ok(vec![apply_delta( self.content.as_ref().unwrap(), text.as_slice(), ) .unwrap()]), } } /// Set the content of this block to the given chunks. pub fn set_chunked_content(&mut self, content_chunks: &[Vec], length: usize) { // If we have lots of short lines, it is may be more efficient to join // the content ahead of time. If the content is <10MiB, we don't really // care about the extra memory consumption, so we can just pack it and // be done. However, timing showed 18s => 17.9s for repacking 1k revs of // mysql, which is below the noise margin self.content_length = Some(length); self.content_chunks = Some(content_chunks.to_vec()); self.content = None; self.z_content_chunks = None; } /// Set the content of this block. pub fn set_content(&mut self, content: &[u8]) { self.content_length = Some(content.len()); self.content = Some(content.to_vec()); self.z_content_chunks = None; } fn create_z_content_from_chunks( &mut self, chunks: Vec>, compressor_kind: CompressorKind, ) { let chunks = match compressor_kind { CompressorKind::Zlib => { let mut encoder = flate2::write::ZlibEncoder::new(Vec::new(), flate2::Compression::default()); for chunk in chunks { encoder.write_all(&chunk).unwrap(); } encoder.finish().unwrap() } CompressorKind::Lzma => { let mut encoder = xz2::write::XzEncoder::new(Vec::new(), 6); for chunk in chunks { encoder.write_all(&chunk).unwrap(); } encoder.finish().unwrap() } }; self.z_content_length = Some(chunks.len()); self.z_content_chunks = Some(vec![chunks]); } fn create_z_content(&mut self, compressor_kind: CompressorKind) { if self.z_content_chunks.is_some() && self.compressor == Some(compressor_kind) { return; } let chunks = if let Some(content_chunks) = self.content_chunks.as_ref() { content_chunks.to_vec() } else { vec![self.content.as_ref().unwrap().clone()] }; self.create_z_content_from_chunks(chunks, compressor_kind); } /// Create the byte stream as a series of 'chunks'. pub fn to_chunks( &mut self, compressor_kind: Option, ) -> (usize, Vec>) { let compressor_kind = compressor_kind.unwrap_or_default(); self.create_z_content(compressor_kind); let lengths = format!( "{}\n{}\n", self.z_content_length.unwrap(), self.content_length.unwrap() ); let mut chunks = vec![ Cow::Borrowed(compressor_kind.header()), Cow::Owned(lengths.as_bytes().to_vec()), ]; chunks.extend( self.z_content_chunks .as_ref() .unwrap() .iter() .map(|x| Cow::Borrowed(x.as_slice())), ); let total_len = chunks.iter().map(|x| x.len()).sum(); (total_len, chunks) } /// Encode the information into a byte stream. pub fn to_bytes(&mut self) -> Vec { let (_total_len, chunks) = self.to_chunks(None); chunks.concat() } /// Take this block, and spit out a human-readable structure. /// /// # Arguments /// * `include_text`: Inserts also include text bits, chose whether you want this displayed in /// the dump or not. /// /// # Returns /// A dump of the given block. The layout is something like: [('f', length), ('d', /// delta_length, text_length, [delta_info])] /// delta_info := [('i', num_bytes, text), ('c', offset, num_bytes), ...] pub fn dump(&mut self, include_text: Option) -> Result, Error> { let include_text = include_text.unwrap_or(false); self.ensure_content(None); let mut result = vec![]; let mut content = self.content.as_ref().unwrap().as_slice(); while !content.is_empty() { match read_item(&mut content)? { GroupCompressItem::Fulltext(text) => { // Fulltext if include_text { result.push(DumpInfo::Fulltext(Some(text))); } else { result.push(DumpInfo::Fulltext(None)); } } GroupCompressItem::Delta(delta_content) => { let mut delta_info = vec![]; // The first entry in a delta is the decompressed length let mut delta_slice = delta_content.as_slice(); let decomp_len = read_base128_int(&mut delta_slice).unwrap(); let mut measured_len = 0; while !delta_slice.is_empty() { match read_instruction(&mut delta_slice)? { Instruction::Insert(text) => { measured_len += text.len(); delta_info.push(DeltaInfo::Insert( text.len(), if include_text { Some(text) } else { None }, )); } Instruction::r#Copy { offset, length } => { delta_info.push(DeltaInfo::Copy( offset, length, if include_text { Some( self.content.as_ref().unwrap()[offset..offset + length] .to_vec(), ) } else { None }, )); measured_len += length; } } } if measured_len != decomp_len as usize { return Err(Error::InvalidData(format!( "Delta claimed fulltext was {} bytes, but extraction resulted in {}", decomp_len, measured_len ))); } result.push(DumpInfo::Delta(decomp_len as usize, delta_info)); } } } Ok(result) } } pub enum DeltaInfo { Insert(usize, Option>), Copy(usize, usize, Option>), } pub enum DumpInfo { Fulltext(Option>), Delta(usize, Vec), } bzrformats_3.4.0.orig/crates/bazaar/src/groupcompress/compressor.rs0000644000000000000000000002115115162074037022645 0ustar00use crate::groupcompress::block::{read_item, GroupCompressItem}; use crate::groupcompress::delta::{apply_delta, write_base128_int}; use crate::groupcompress::NULL_SHA1; use crate::versionedfile::{Error, Key}; use std::borrow::Cow; use std::collections::HashMap; pub trait GroupCompressor { /// Compress lines with label key. /// /// # Arguments /// * `key`: A key tuple. It is stored in the output /// for identification of the text during decompression. If the last /// element is b'None' it is replaced with the sha1 of the text - /// e.g. sha1:xxxxxxx. /// * `chunks`: Chunks of bytes to be compressed /// * `length`: Length of chunks /// * `expected_sha`: If non-None, the sha the lines are believed to /// have. During compression the sha is calculated; a mismatch will /// cause an error. /// * `nostore_sha`: If the computed sha1 sum matches, we will raise /// ExistingContent rather than adding the text. /// * `soft`: Do a 'soft' compression. This means that we require larger /// ranges to match to be considered for a copy command. /// /// # Returns /// The sha1 of lines, the start and end offsets in the delta, and the type ('fulltext' or /// 'delta'). fn compress( &mut self, key: &Key, chunks: &[&[u8]], length: usize, expected_sha: Option, nostore_sha: Option, soft: Option, ) -> Result<(String, usize, usize, &'static str), Error> { if length == 0 { // empty, like a dir entry, etc if nostore_sha == Some(String::from_utf8_lossy(NULL_SHA1.as_slice()).to_string()) { return Err(Error::ExistingContent(key.clone())); } return Ok(( String::from_utf8_lossy(NULL_SHA1.as_slice()).to_string(), 0, 0, "fulltext", )); } // we assume someone knew what they were doing when they passed it in let sha = expected_sha.unwrap_or_else(|| osutils::sha::sha_chunks(chunks)); if let Some(nostore_sha) = nostore_sha { if sha == nostore_sha { return Err(Error::ExistingContent(key.clone())); } } let key = match key { Key::Fixed(key) => key.clone(), Key::ContentAddressed(key) => { let mut key = key.clone(); key.push(format!("sha1:{}", sha).as_bytes().to_vec()); key } }; let (start, end, r#type) = self.compress_block(&key, chunks, length, (length / 2) as u128, soft)?; Ok((sha, start, end, r#type)) } /// Compress chunks with label key. /// /// :param key: A key tuple. It is stored in the output for identification /// of the text during decompression. /// /// :param chunks: The chunks of bytes to be compressed /// /// :param input_len: The length of the chunks /// /// :param max_delta_size: The size above which we issue a fulltext instead /// of a delta. /// /// :param soft: Do a 'soft' compression. This means that we require larger /// ranges to match to be considered for a copy command. /// /// # Returns /// The sha1 of lines, the start and end offsets in the delta, and /// the type ('fulltext' or 'delta'). fn compress_block( &mut self, key: &[Vec], chunks: &[&[u8]], input_len: usize, max_delta_size: u128, soft: Option, ) -> Result<(usize, usize, &'static str), Error>; /// Return the overall compression ratio. fn ratio(&self) -> f32; /// Finish this group, creating a formatted stream. /// /// After calling this, the compressor should no longer be used fn flush(self) -> (Vec>, usize); /// Call this if you want to 'revoke' the last compression. /// /// After this, the data structures will be rolled back, but you cannot do more compression. fn flush_without_last(self) -> (Vec>, usize); } pub struct TraditionalGroupCompressor { delta_index: crate::groupcompress::line_delta::LinesDeltaIndex, endpoint: usize, input_bytes: usize, last: Option<(usize, usize)>, labels_deltas: HashMap>, (usize, usize, usize, usize)>, } impl GroupCompressor for TraditionalGroupCompressor { fn ratio(&self) -> f32 { self.input_bytes as f32 / self.endpoint as f32 } fn flush(self) -> (Vec>, usize) { (self.delta_index.lines().to_vec(), self.endpoint) } fn flush_without_last(self) -> (Vec>, usize) { let last = self.last.unwrap(); (self.delta_index.lines()[..last.0].to_vec(), last.1) } fn compress_block( &mut self, key: &[Vec], chunks: &[&[u8]], input_len: usize, max_delta_size: u128, soft: Option, ) -> Result<(usize, usize, &'static str), Error> { let new_lines = osutils::chunks_to_lines(chunks.iter().map(|x| Ok::<_, std::io::Error>(*x))) .collect::, _>>() .unwrap(); let (mut out_lines, mut index_lines) = self.delta_index .make_delta(new_lines.as_slice(), input_len, soft); let delta_length = out_lines.iter().map(|l| l.len() as u128).sum(); let (r#type, out_lines) = if delta_length > max_delta_size { // The delta is longer than the fulltext, insert a fulltext let mut out_lines = vec![Cow::Borrowed(&b"f"[..]), { let mut data = Vec::new(); write_base128_int(&mut data, input_len as u128).unwrap(); Cow::Owned(data) }]; index_lines.clear(); index_lines.extend(vec![false, false]); index_lines.extend([true].repeat(new_lines.len())); out_lines.extend(new_lines); ("fulltext", out_lines) } else { // this is a worthy delta, output it out_lines[0] = Cow::Borrowed(&b"d"[..]); // Update the delta_length to include those two encoded integers { let mut data = Vec::new(); write_base128_int(&mut data, delta_length).unwrap(); out_lines[1] = Cow::Owned(data); } ("delta", out_lines) }; // Before insertion let start = self.endpoint; let chunk_start = self.delta_index.lines().len(); self.last = Some((chunk_start, self.endpoint)); self.delta_index.extend_lines( out_lines .into_iter() .map(|x| x.into_owned()) .collect::>() .as_slice(), &index_lines, ); self.endpoint = self.delta_index.endpoint(); self.input_bytes += input_len; let chunk_end = self.delta_index.lines().len(); self.labels_deltas .insert(key.to_vec(), (start, chunk_start, self.endpoint, chunk_end)); Ok((start, self.endpoint, r#type)) } } impl Default for TraditionalGroupCompressor { fn default() -> Self { Self::new() } } impl TraditionalGroupCompressor { pub fn new() -> Self { Self { delta_index: crate::groupcompress::line_delta::LinesDeltaIndex::new(vec![]), endpoint: 0, input_bytes: 0, last: None, labels_deltas: HashMap::new(), } } pub fn chunks(&self) -> &[Vec] { self.delta_index.lines() } pub fn endpoint(&self) -> usize { self.endpoint } /// Extract a key previously added to the compressor. /// /// # Arguments /// * `key`: The key to extract. /// /// # Returns /// An iterable over chunks and the sha1. pub fn extract(&self, key: &Vec>) -> Result<(Vec>, String), String> { let (_start_byte, start_chunk, _end_byte, end_chunk) = self.labels_deltas.get(key).unwrap(); let delta_chunks = &self.delta_index.lines()[*start_chunk..*end_chunk]; let stored_bytes = delta_chunks.concat(); let data = match read_item(&mut stored_bytes.as_slice()).map_err(|e| e.to_string())? { GroupCompressItem::Fulltext(data) => vec![data], GroupCompressItem::Delta(data) => { let source = self.delta_index.lines()[..*start_chunk].concat(); vec![apply_delta(source.as_slice(), data.as_slice())?] } }; let data_sha1 = osutils::sha::sha_chunks(data.as_slice()); Ok((data, data_sha1)) } } bzrformats_3.4.0.orig/crates/bazaar/src/groupcompress/delta.rs0000644000000000000000000004235515162074037021553 0ustar00use byteorder::{ReadBytesExt, WriteBytesExt}; use std::io::{Read, Write}; pub const MAX_INSERT_SIZE: usize = 0x7F; pub const MAX_COPY_SIZE: usize = 0x10000; #[deprecated] pub fn encode_base128_int(val: u128) -> Vec { let mut data = Vec::new(); write_base128_int(&mut data, val).unwrap(); data } /// Encode an integer using base128 encoding. pub fn write_base128_int(mut writer: W, val: u128) -> std::io::Result { let mut val = val; let mut length = 0; while val >= 0x80 { writer.write_all(&[((val | 0x80) & 0xFF) as u8])?; length += 1; val >>= 7; } writer.write_all(&[val as u8])?; Ok(length + 1) } /// Decode a base128 encoded integer. pub fn read_base128_int(reader: &mut R) -> Result { let mut val: u128 = 0; let mut shift = 0; let mut bval = [0]; reader.read_exact(&mut bval)?; while bval[0] >= 0x80 { val |= ((bval[0] & 0x7F) as u128) << shift; reader.read_exact(&mut bval)?; shift += 7; } val |= (bval[0] as u128) << shift; Ok(val) } #[cfg(test)] mod test_base128_int { #[test] fn test_decode_base128_int() { assert_eq!(super::decode_base128_int(&[0x00]), (0, 1)); assert_eq!(super::decode_base128_int(&[0x01]), (1, 1)); assert_eq!(super::decode_base128_int(&[0x7F]), (127, 1)); assert_eq!(super::decode_base128_int(&[0x80, 0x01]), (128, 2)); assert_eq!(super::decode_base128_int(&[0xFF, 0x01]), (255, 2)); assert_eq!(super::decode_base128_int(&[0x80, 0x02]), (256, 2)); assert_eq!(super::decode_base128_int(&[0x81, 0x02]), (257, 2)); assert_eq!(super::decode_base128_int(&[0x82, 0x02]), (258, 2)); assert_eq!(super::decode_base128_int(&[0xFF, 0x7F]), (16383, 2)); assert_eq!(super::decode_base128_int(&[0x80, 0x80, 0x01]), (16384, 3)); assert_eq!(super::decode_base128_int(&[0xFF, 0xFF, 0x7F]), (2097151, 3)); assert_eq!( super::decode_base128_int(&[0x80, 0x80, 0x80, 0x01]), (2097152, 4) ); assert_eq!( super::decode_base128_int(&[0xFF, 0xFF, 0xFF, 0x7F]), (268435455, 4) ); assert_eq!( super::decode_base128_int(&[0x80, 0x80, 0x80, 0x80, 0x01]), (268435456, 5) ); assert_eq!( super::decode_base128_int(&[0xFF, 0xFF, 0xFF, 0xFF, 0x7F]), (34359738367, 5) ); assert_eq!( super::decode_base128_int(&[0x80, 0x80, 0x80, 0x80, 0x80, 0x01]), (34359738368, 6) ); } #[test] fn test_encode_base128_int() { assert_eq!(super::encode_base128_int(0), [0x00]); assert_eq!(super::encode_base128_int(1), [0x01]); assert_eq!(super::encode_base128_int(127), [0x7F]); assert_eq!(super::encode_base128_int(128), [0x80, 0x01]); assert_eq!(super::encode_base128_int(255), [0xFF, 0x01]); assert_eq!(super::encode_base128_int(256), [0x80, 0x02]); assert_eq!(super::encode_base128_int(257), [0x81, 0x02]); assert_eq!(super::encode_base128_int(258), [0x82, 0x02]); assert_eq!(super::encode_base128_int(16383), [0xFF, 0x7F]); assert_eq!(super::encode_base128_int(16384), [0x80, 0x80, 0x01]); assert_eq!(super::encode_base128_int(2097151), [0xFF, 0xFF, 0x7F]); assert_eq!(super::encode_base128_int(2097152), [0x80, 0x80, 0x80, 0x01]); assert_eq!( super::encode_base128_int(268435455), [0xFF, 0xFF, 0xFF, 0x7F] ); assert_eq!( super::encode_base128_int(268435456), [0x80, 0x80, 0x80, 0x80, 0x01] ); assert_eq!( super::encode_base128_int(34359738367), [0xFF, 0xFF, 0xFF, 0xFF, 0x7F] ); assert_eq!( super::encode_base128_int(34359738368), [0x80, 0x80, 0x80, 0x80, 0x80, 0x01] ); assert_eq!( super::encode_base128_int(4398046511103), [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x7F] ); assert_eq!( super::encode_base128_int(4398046511104), [0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x01] ); } } #[deprecated] pub fn decode_base128_int(data: &[u8]) -> (u128, usize) { let mut cursor = std::io::Cursor::new(data); let val = read_base128_int(&mut cursor).unwrap(); (val, cursor.position() as usize) } #[deprecated] pub fn decode_copy_instruction( data: &[u8], cmd: u8, pos: usize, ) -> Result<(usize, usize, usize), String> { let mut c = std::io::Cursor::new(&data[pos..]); let (offset, length) = read_copy_instruction(&mut c, cmd).unwrap(); Ok((offset, length, pos + c.position() as usize)) } pub type CopyInstruction = (usize, usize); pub fn read_copy_instruction( reader: &mut R, cmd: u8, ) -> Result { if cmd & 0x80 != 0x80 { return Err(std::io::Error::new( std::io::ErrorKind::Other, "copy instructions must have bit 0x80 set".to_string(), )); } let mut offset = 0; let mut length = 0; if cmd & 0x01 != 0 { offset = reader.read_u8()? as usize; } if cmd & 0x02 != 0 { offset |= (reader.read_u8()? as usize) << 8; } if cmd & 0x04 != 0 { offset |= (reader.read_u8()? as usize) << 16; } if cmd & 0x08 != 0 { offset |= (reader.read_u8()? as usize) << 24; } if cmd & 0x10 != 0 { length = reader.read_u8()? as usize; } if cmd & 0x20 != 0 { length |= (reader.read_u8()? as usize) << 8; } if cmd & 0x40 != 0 { length |= (reader.read_u8()? as usize) << 16; } if length == 0 { length = 65536; } Ok((offset, length)) } pub fn apply_delta(basis: &[u8], mut delta: &[u8]) -> Result, String> { let target_length = read_base128_int(&mut delta).map_err(|e| e.to_string())?; let mut lines = Vec::new(); while !delta.is_empty() { let cmd = delta.read_u8().map_err(|e| e.to_string())?; if cmd & 0x80 != 0 { let (offset, length) = read_copy_instruction(&mut delta, cmd).map_err(|e| e.to_string())?; let last = offset + length; if last > basis.len() { return Err("data would copy bytes past the end of source".to_string()); } lines.extend_from_slice(&basis[offset..last]); } else { if cmd == 0 { return Err("Command == 0 not supported yet".to_string()); } lines.extend_from_slice(&delta[..cmd as usize]); delta = &delta[cmd as usize..]; } } if lines.len() != target_length as usize { return Err(format!( "Delta claimed to be {} long, but ended up {} long", target_length, lines.len() )); } Ok(lines) } #[cfg(test)] mod test_apply_delta { const TEXT1: &[u8] = b"This is a bit of source text which is meant to be matched against other text "; const TEXT2: &[u8] = b"This is a bit of source text which is meant to differ from against other text "; #[test] fn test_apply_delta() { let target = super::apply_delta(TEXT1, b"N\x90/\x1fdiffer from\nagainst other text\n").unwrap(); assert_eq!(target, TEXT2); let target = super::apply_delta(TEXT2, b"M\x90/\x1ebe matched\nagainst other text\n").unwrap(); assert_eq!(target, TEXT1); } } #[deprecated] pub fn apply_delta_to_source( source: &[u8], delta_start: usize, delta_end: usize, ) -> Result, String> { let source_size = source.len(); if delta_start >= source_size { return Err("delta starts after source".to_string()); } if delta_end > source_size { return Err("delta ends after source".to_string()); } if delta_start >= delta_end { return Err("delta starts after it ends".to_string()); } let delta_bytes = &source[delta_start..delta_end]; apply_delta(source, delta_bytes) } pub fn encode_copy_instruction(mut offset: usize, mut length: usize) -> Vec { let mut copy_bytes = vec![]; // Convert this offset into a control code and bytes. let mut copy_command: u8 = 0x80; for copy_bit in [0x01, 0x02, 0x04, 0x08].iter() { let base_byte = (offset & 0xff) as u8; if base_byte != 0 { copy_command |= *copy_bit; copy_bytes.push(base_byte); } offset >>= 8; } assert!( length <= MAX_COPY_SIZE, "we don't emit copy records for lengths > 64KiB" ); assert_ne!(length, 0, "we don't emit copy records for lengths == 0"); if length != 0x10000 { // A copy of length exactly 64*1024 == 0x10000 is sent as a length of 0, // since that saves bytes for large chained copies for copy_bit in [0x10, 0x20].iter() { let base_byte = (length & 0xff) as u8; if base_byte != 0 { copy_command |= *copy_bit; copy_bytes.push(base_byte); } length >>= 8; } } copy_bytes.insert(0, copy_command); copy_bytes } pub fn write_copy_instruction( mut writer: W, offset: usize, length: usize, ) -> Result { let data = encode_copy_instruction(offset, length); writer.write_all(data.as_slice())?; Ok(data.len()) } pub fn write_insert_instruction( mut writer: W, data: &[u8], ) -> Result { assert!(data.len() <= 0x7F); writer.write_u8(data.len() as u8)?; writer.write_all(data)?; Ok(data.len() + 1) } #[derive(Debug, PartialEq, Eq)] pub enum Instruction> { r#Copy { offset: usize, length: usize }, Insert(T), } pub fn write_instruction>( writer: W, instruction: &Instruction, ) -> std::io::Result { match instruction { Instruction::Copy { offset, length } => write_copy_instruction(writer, *offset, *length), Instruction::Insert(data) => write_insert_instruction(writer, data.borrow()), } } pub fn read_instruction(mut reader: R) -> Result>, std::io::Error> { let cmd = reader.read_u8()?; if cmd & 0x80 != 0 { let (offset, length) = read_copy_instruction(&mut reader, cmd)?; Ok(Instruction::Copy { offset, length }) } else if cmd == 0 { Err(std::io::Error::new( std::io::ErrorKind::InvalidData, "Command == 0 not supported yet", )) } else { let length = cmd as usize; let mut data = vec![0; length]; reader.read_exact(&mut data)?; Ok(Instruction::Insert(data)) } } /// Decode a copy instruction from the given data, starting at the given position. pub fn decode_instruction(data: &[u8], pos: usize) -> Result<(Instruction<&[u8]>, usize), String> { let cmd = data[pos]; if cmd & 0x80 != 0 { let mut c = std::io::Cursor::new(&data[pos + 1..]); let (offset, length) = read_copy_instruction(&mut c, cmd).map_err(|e| e.to_string())?; let newpos = pos + 1 + c.position() as usize; Ok((Instruction::Copy { offset, length }, newpos)) } else { let length = cmd as usize; let newpos = pos + 1 + length; if newpos > data.len() { return Err(format!( "Instruction length {} at position {} extends past end of data", length, pos )); } Ok((Instruction::Insert(&data[pos + 1..newpos]), newpos)) } } #[cfg(test)] mod test_copy_instruction { fn assert_encode(expected: &[u8], offset: usize, length: usize) { let data = super::encode_copy_instruction(offset, length); assert_eq!(expected, data); } fn assert_decode( exp_offset: usize, exp_length: usize, exp_newpos: usize, data: &[u8], mut pos: usize, ) { let cmd = data[pos]; pos += 1; let out = super::decode_copy_instruction(data, cmd, pos).unwrap(); assert_eq!((exp_offset, exp_length, exp_newpos), out); } #[test] fn test_encode_no_length() { assert_encode(b"\x80", 0, 64 * 1024); assert_encode(b"\x81\x01", 1, 64 * 1024); assert_encode(b"\x81\x0a", 10, 64 * 1024); assert_encode(b"\x81\xff", 255, 64 * 1024); assert_encode(b"\x82\x01", 256, 64 * 1024); assert_encode(b"\x83\x01\x01", 257, 64 * 1024); assert_encode(b"\x8F\xff\xff\xff\xff", 0xFFFFFFFF, 64 * 1024); assert_encode(b"\x8E\xff\xff\xff", 0xFFFFFF00, 64 * 1024); assert_encode(b"\x8D\xff\xff\xff", 0xFFFF00FF, 64 * 1024); assert_encode(b"\x8B\xff\xff\xff", 0xFF00FFFF, 64 * 1024); assert_encode(b"\x87\xff\xff\xff", 0x00FFFFFF, 64 * 1024); assert_encode(b"\x8F\x04\x03\x02\x01", 0x01020304, 64 * 1024); } #[test] fn test_encode_no_offset() { assert_encode(b"\x90\x01", 0, 1); assert_encode(b"\x90\x0a", 0, 10); assert_encode(b"\x90\xff", 0, 255); assert_encode(b"\xA0\x01", 0, 256); assert_encode(b"\xB0\x01\x01", 0, 257); assert_encode(b"\xB0\xff\xff", 0, 0xFFFF); // Special case, if copy == 64KiB, then we store exactly 0 // Note that this puns with a copy of exactly 0 bytes, but we don't care // about that, as we would never actually copy 0 bytes assert_encode(b"\x80", 0, 64 * 1024) } #[test] fn test_encode() { assert_encode(b"\x91\x01\x01", 1, 1); assert_encode(b"\x91\x09\x0a", 9, 10); assert_encode(b"\x91\xfe\xff", 254, 255); assert_encode(b"\xA2\x02\x01", 512, 256); assert_encode(b"\xB3\x02\x01\x01\x01", 258, 257); assert_encode(b"\xB0\x01\x01", 0, 257); // Special case, if copy == 64KiB, then we store exactly 0 // Note that this puns with a copy of exactly 0 bytes, but we don't care // about that, as we would never actually copy 0 bytes assert_encode(b"\x81\x0a", 10, 64 * 1024); } #[test] fn test_decode_no_length() { // If length is 0, it is interpreted as 64KiB // The shortest possible instruction is a copy of 64KiB from offset 0 assert_decode(0, 65536, 1, b"\x80", 0); assert_decode(1, 65536, 2, b"\x81\x01", 0); assert_decode(10, 65536, 2, b"\x81\x0a", 0); assert_decode(255, 65536, 2, b"\x81\xff", 0); assert_decode(256, 65536, 2, b"\x82\x01", 0); assert_decode(257, 65536, 3, b"\x83\x01\x01", 0); assert_decode(0xFFFFFFFF, 65536, 5, b"\x8F\xff\xff\xff\xff", 0); assert_decode(0xFFFFFF00, 65536, 4, b"\x8E\xff\xff\xff", 0); assert_decode(0xFFFF00FF, 65536, 4, b"\x8D\xff\xff\xff", 0); assert_decode(0xFF00FFFF, 65536, 4, b"\x8B\xff\xff\xff", 0); assert_decode(0x00FFFFFF, 65536, 4, b"\x87\xff\xff\xff", 0); assert_decode(0x01020304, 65536, 5, b"\x8F\x04\x03\x02\x01", 0); } #[test] fn test_decode_no_offset() { assert_decode(0, 1, 2, b"\x90\x01", 0); assert_decode(0, 10, 2, b"\x90\x0a", 0); assert_decode(0, 255, 2, b"\x90\xff", 0); assert_decode(0, 256, 2, b"\xA0\x01", 0); assert_decode(0, 257, 3, b"\xB0\x01\x01", 0); assert_decode(0, 65535, 3, b"\xB0\xff\xff", 0); // Special case, if copy == 64KiB, then we store exactly 0 // Note that this puns with a copy of exactly 0 bytes, but we don't care // about that, as we would never actually copy 0 bytes assert_decode(0, 65536, 1, b"\x80", 0); } #[test] fn test_decode() { assert_decode(1, 1, 3, b"\x91\x01\x01", 0); assert_decode(9, 10, 3, b"\x91\x09\x0a", 0); assert_decode(254, 255, 3, b"\x91\xfe\xff", 0); assert_decode(512, 256, 3, b"\xA2\x02\x01", 0); assert_decode(258, 257, 5, b"\xB3\x02\x01\x01\x01", 0); assert_decode(0, 257, 3, b"\xB0\x01\x01", 0); } #[test] fn test_decode_not_start() { assert_decode(1, 1, 6, b"abc\x91\x01\x01def", 3); assert_decode(9, 10, 5, b"ab\x91\x09\x0ade", 2); assert_decode(254, 255, 6, b"not\x91\xfe\xffcopy", 3); } } #[cfg(test)] mod test_instruction { use super::{decode_instruction, Instruction}; #[test] fn test_decode_copy_instruction() { assert_eq!( Ok(( Instruction::Copy { offset: 0, length: 65536 }, 1 )), decode_instruction(&b"\x80"[..], 0) ); assert_eq!( Ok(( Instruction::Copy { offset: 10, length: 65536 }, 2 )), decode_instruction(&b"\x81\x0a"[..], 0) ); } #[test] fn test_decode_insert_instruction() { assert_eq!( Ok((Instruction::Insert(&b"\x00"[..]), 2)), decode_instruction(&b"\x01\x00"[..], 0) ); assert_eq!( Ok((Instruction::Insert(&b"\x01"[..]), 2)), decode_instruction(&b"\x01\x01"[..], 0) ); assert_eq!( Ok((Instruction::Insert(&b"\xff\x05"[..]), 3)), decode_instruction(&b"\x02\xff\x05"[..], 0) ); } } bzrformats_3.4.0.orig/crates/bazaar/src/groupcompress/line_delta.rs0000644000000000000000000003410615162074037022555 0ustar00use crate::groupcompress::delta::{encode_copy_instruction, write_base128_int}; use std::borrow::Cow; pub struct OutputHandler<'a> { out_lines: Vec>, index_lines: Vec, min_len_to_index: usize, cur_insert_lines: Vec>, cur_insert_len: usize, } impl<'a> OutputHandler<'a> { pub fn new( out_lines: Vec>, index_lines: Vec, min_len_to_index: usize, ) -> Self { OutputHandler { out_lines, index_lines, min_len_to_index, cur_insert_lines: Vec::new(), cur_insert_len: 0, } } pub fn add_copy(&mut self, start_byte: usize, end_byte: usize) { // The data stream allows >64kB in a copy, but to match the compiled // code, we will also limit it to a 64kB copy for start in (start_byte..end_byte).step_by(64 * 1024) { let num_bytes = (end_byte - start).min(64 * 1024); let copy_bytes = encode_copy_instruction(start, num_bytes); self.out_lines.push(Cow::Owned(copy_bytes)); self.index_lines.push(false); } } fn flush_insert(&mut self) { if self.cur_insert_lines.is_empty() { return; } if self.cur_insert_len > 0x7f { panic!("We cannot insert more than 127 bytes at a time."); } self.out_lines .push(Cow::Owned(vec![self.cur_insert_len as u8])); self.index_lines.push(false); self.out_lines .extend_from_slice(self.cur_insert_lines.as_slice()); self.index_lines.extend(vec![ self.cur_insert_len >= self.min_len_to_index; self.cur_insert_lines.len() ]); self.cur_insert_lines.clear(); self.cur_insert_len = 0; } fn insert_long_line(&mut self, line: Cow<'a, [u8]>) { // Flush out anything pending self.flush_insert(); let line_len = line.len(); for start_index in (0..line_len).step_by(0x7f) { let next_len = (line_len - start_index).min(0x7f); self.out_lines.push(Cow::Owned(vec![next_len as u8])); self.index_lines.push(false); // TODO(mem): This should ideally be Cow::Borrowed: self.out_lines.push(Cow::Owned( line.as_ref()[start_index..start_index + next_len].to_vec(), )); // We don't index long lines, because we won't be able to match // a line split across multiple inserts anway self.index_lines.push(false); } } pub fn add_insert(&mut self, lines: impl Iterator>) { if !self.cur_insert_lines.is_empty() { panic!("self.cur_insert_lines must be empty when adding a new insert"); } for line in lines { if line.len() > 0x7f { self.insert_long_line(line); } else { let next_len = line.len() + self.cur_insert_len; if next_len > 0x7f { // Adding this line would overflow, so flush, and start over self.flush_insert(); self.cur_insert_len = line.len(); self.cur_insert_lines = vec![line]; } else { self.cur_insert_lines.push(line); self.cur_insert_len = next_len; } } } self.flush_insert(); } } /// This class indexes matches between strings. /// /// # Attributes /// * `lines`: The 'static' lines that will be preserved between runs. /// * `matching_lines`: A dict of {line:[matching offsets]} /// * `line_offsets`: The byte offset for the end of each line, used to quickly map between a /// matching line number and the byte location /// * `endpoint: The total number of bytes in self.line_offsets use std::collections::{HashMap, HashSet}; pub struct LinesDeltaIndex { lines: Vec>, line_offsets: Vec, endpoint: usize, matching_lines: HashMap, HashSet>, } impl LinesDeltaIndex { const MIN_MATCH_BYTES: usize = 10; const SOFT_MIN_MATCH_BYTES: usize = 200; pub fn new(lines: Vec>) -> Self { let mut delta_index = LinesDeltaIndex { lines: vec![], line_offsets: vec![], endpoint: 0, matching_lines: HashMap::new(), }; let index = vec![true; lines.len()]; delta_index.extend_lines(lines.as_slice(), index.as_slice()); delta_index } pub fn lines(&self) -> &[Vec] { self.lines.as_slice() } fn update_matching_lines(&mut self, new_lines: &[Vec], index: &[bool]) { let matches = &mut self.matching_lines; let start_idx = self.lines.len(); if new_lines.len() != index.len() { panic!( "The number of lines to be indexed does not match the index/don't index flags: {} != {}", new_lines.len(), index.len() ); } for (idx, (line, &do_index)) in std::iter::zip(new_lines, index).enumerate() { if !do_index { continue; } matches .entry(line.clone()) .or_default() .insert(start_idx + idx); } } /// Return the lines which match the line in right pub fn get_matches(&self, line: &[u8]) -> Option<&HashSet> { self.matching_lines.get(line) } /// Look at all matches for the current line, return the longest. /// /// # Arguments /// /// * `lines`: The lines we are matching against /// * `pos`: The current location we care about /// * `locations`: A list of lines that matched the current location. /// This may be None, but often we'll have already found matches for /// this line. /// /// # Returns /// (start_in_self, start_in_lines, num_lines) /// All values are the offset in the list (aka the line number) /// If start_in_self is None, then we have no matches, and this line /// should be inserted in the target. fn get_longest_match( &self, lines: &[Cow<'_, [u8]>], mut pos: usize, ) -> (Option<(usize, usize, usize)>, usize) { let range_start = pos; let mut range_len = 0; let mut prev_locations: Option> = None; let max_pos = lines.len(); while pos < max_pos { match self.matching_lines.get(lines[pos].as_ref()) { Some(locations) => { // We have a match if let Some(prev) = prev_locations.as_ref() { // We have a match started, compare to see if any of the curent matches can // be continued. let next_locations: HashSet = locations .intersection( &prev.iter().map(|&loc| loc + 1).collect::>(), ) .cloned() .collect(); if !next_locations.is_empty() { // At least one of the regions continues to match prev_locations = Some(next_locations); range_len += 1; } else { // All the current regions no longer match. // This line does still match something, just not at the end of the // previous matches. WE will return location so sthat we can avoid // another _matching_lines lookup. break; } } else { // This is the first match in a range prev_locations = Some(locations.clone()); range_len = 1; } pos += 1; } None => { // No more matches, just return wahtever we have, but we know that this last // position is not going to match anything. pos += 1; break; } } } if let Some(prev) = prev_locations { let smallest = *prev.iter().min().unwrap(); ( Some((smallest + 1 - range_len, range_start, range_len)), pos, ) } else { (None, pos) } } /// Return the ranges in lines which match self.lines. /// /// # Arguments /// * `lines`: :param lines: lines to compress /// /// # Returns /// A list of (old_start, new_start, length) tuples which reflect /// a region in self.lines that is present in lines. The last element /// of the list is always (old_len, new_len, 0) to provide a end point /// for generating instructions from the matching blocks list. fn get_matching_blocks( &self, lines: &[Cow<'_, [u8]>], soft: bool, ) -> Vec<(usize, usize, usize)> { // In this code, we iterate over multiple _get_longest_match calls, to // find the next longest copy, and possible insert regions. We then // convert that to the simple matching_blocks representation, since // otherwise inserting 10 lines in a row would show up as 10 // instructions. let mut result = Vec::new(); let mut pos = 0; let max_pos = lines.len(); let min_match_bytes = if soft { Self::SOFT_MIN_MATCH_BYTES } else { Self::MIN_MATCH_BYTES }; while pos < max_pos { let (block, new_pos) = self.get_longest_match(lines, pos); if let Some(block) = block { // Check to see if we match fewer than min_match_bytes. As we // will turn this into a pure 'insert', rather than a copy. // block[-1] is the number of lines. A quick check says if we // have more lines than min_match_bytes, then we know we have // enough bytes. if block.2 < min_match_bytes { // This block may be a 'short' block, check let (_old_start, new_start, range_len) = block; let matched_bytes: usize = lines[new_start..new_start + range_len] .iter() .map(|line| line.len()) .sum(); if matched_bytes >= min_match_bytes { result.push(block); } } else { result.push(block); } } pos = new_pos; } result.push((self.lines.len(), lines.len(), 0)); result } /// Add more lines to the left-lines list. /// /// # Arguments /// * `lines`: The lines to add. /// * `index`: A list of booleans indicating whether each line should be indexed. pub fn extend_lines(&mut self, lines: &[Vec], index: &[bool]) { self.update_matching_lines(lines, index); self.lines.extend_from_slice(lines); let mut endpoint = self.endpoint; for line in lines { endpoint += std::convert::Into::>::into(line).len(); self.line_offsets.push(endpoint); } assert_eq!( self.line_offsets.len(), self.lines.len(), "Somehow the line offset indicator got out of sync with the line counter" ); self.endpoint = endpoint; } pub fn endpoint(&self) -> usize { self.endpoint } /// Compute the delta for this content versus the original content. pub fn make_delta<'a>( &self, new_lines: &'_ [Cow<'a, [u8]>], bytes_length: usize, soft: Option, ) -> (Vec>, Vec) { let soft = soft.unwrap_or(false); let out_lines = vec![ // reserved for content type, content length Cow::Owned(vec![]), Cow::Owned(vec![]), { let mut data = Vec::new(); write_base128_int(&mut data, bytes_length as u128).unwrap(); Cow::Owned(data) }, ]; let index_lines = vec![false, false, false]; let mut output_handler = OutputHandler::new(out_lines, index_lines, Self::MIN_MATCH_BYTES); let blocks = self.get_matching_blocks(new_lines, soft); let mut current_line_num = 0; // We either copy a range (while there are reusable lines) or we // insert new lines. To find reusable lines we traverse for (old_start, new_start, range_len) in blocks { if new_start != current_line_num { // non-matching region, insert the content output_handler.add_insert(new_lines[current_line_num..new_start].iter().cloned()); } current_line_num = new_start + range_len; if range_len > 0 { // Convert the line based offsets into byte based offsets let first_byte = if old_start == 0 { 0 } else { self.line_offsets[old_start - 1] }; let last_byte = self.line_offsets[old_start + range_len - 1]; output_handler.add_copy(first_byte, last_byte); } } (output_handler.out_lines, output_handler.index_lines) } } /// Create a delta from source to target. pub fn make_delta<'a>( source_bytes: &[u8], target_bytes: &'a [u8], ) -> impl Iterator> { // TODO(perf): Use Cow<[u8]> for the source lines let line_locations = LinesDeltaIndex::new( osutils::split_lines(source_bytes) .map(|x| x.into_owned()) .collect::>(), ); let lines = osutils::split_lines(target_bytes).collect::>(); line_locations .make_delta(lines.as_slice(), target_bytes.len(), None) .0 .into_iter() } bzrformats_3.4.0.orig/crates/bazaar/src/groupcompress/mod.rs0000644000000000000000000000037615162074037021236 0ustar00pub mod block; pub mod compressor; pub mod delta; pub mod line_delta; pub mod rabin_delta; use sha1::{Digest as _, Sha1}; lazy_static::lazy_static! { pub static ref NULL_SHA1: Vec = format!("{:x}", Sha1::new().finalize()).as_bytes().to_vec(); } bzrformats_3.4.0.orig/crates/bazaar/src/groupcompress/rabin_delta.rs0000644000000000000000000005164115162074037022724 0ustar00use crate::groupcompress::delta::{ decode_instruction, read_base128_int, write_base128_int, write_instruction, Instruction, MAX_COPY_SIZE, MAX_INSERT_SIZE, }; use std::collections::HashMap; use std::convert::TryInto; use std::io::Write; /// diff-delta.rs: generate a delta between two buffers /// /// This code was greatly inspired by parts of LibXDiff from Davide Libenzi /// http://www.xmailserver.org/xdiff-lib.html /// /// Rewritten for GIT by Nicolas Pitre , (C) 2005-2007 /// Adapted for Bazaar by John Arbash Meinel (C) 2009 /// /// Ported to Rust by Jelmer Vernooij and significantly rewritten. /// /// This program is free software; you can redistribute it and/or modify /// it under the terms of the GNU General Public License as published by /// the Free Software Foundation; either version 2 of the License, or /// (at your option) any later version. /// /// NB: The version in GIT is 'version 2 of the Licence only', however Nicolas /// has granted permission for use under 'version 2 or later' in private email /// to Robert Collins and Karl Fogel on the 6th April 2009. // maximum hash entry list for the same hash bucket const RABIN_SHIFT: usize = 23; const RABIN_WINDOW: usize = 16; const T: &[u32; 256] = &[ 0x00000000, 0xab59b4d1, 0x56b369a2, 0xfdeadd73, 0x063f6795, 0xad66d344, 0x508c0e37, 0xfbd5bae6, 0x0c7ecf2a, 0xa7277bfb, 0x5acda688, 0xf1941259, 0x0a41a8bf, 0xa1181c6e, 0x5cf2c11d, 0xf7ab75cc, 0x18fd9e54, 0xb3a42a85, 0x4e4ef7f6, 0xe5174327, 0x1ec2f9c1, 0xb59b4d10, 0x48719063, 0xe32824b2, 0x1483517e, 0xbfdae5af, 0x423038dc, 0xe9698c0d, 0x12bc36eb, 0xb9e5823a, 0x440f5f49, 0xef56eb98, 0x31fb3ca8, 0x9aa28879, 0x6748550a, 0xcc11e1db, 0x37c45b3d, 0x9c9defec, 0x6177329f, 0xca2e864e, 0x3d85f382, 0x96dc4753, 0x6b369a20, 0xc06f2ef1, 0x3bba9417, 0x90e320c6, 0x6d09fdb5, 0xc6504964, 0x2906a2fc, 0x825f162d, 0x7fb5cb5e, 0xd4ec7f8f, 0x2f39c569, 0x846071b8, 0x798aaccb, 0xd2d3181a, 0x25786dd6, 0x8e21d907, 0x73cb0474, 0xd892b0a5, 0x23470a43, 0x881ebe92, 0x75f463e1, 0xdeadd730, 0x63f67950, 0xc8afcd81, 0x354510f2, 0x9e1ca423, 0x65c91ec5, 0xce90aa14, 0x337a7767, 0x9823c3b6, 0x6f88b67a, 0xc4d102ab, 0x393bdfd8, 0x92626b09, 0x69b7d1ef, 0xc2ee653e, 0x3f04b84d, 0x945d0c9c, 0x7b0be704, 0xd05253d5, 0x2db88ea6, 0x86e13a77, 0x7d348091, 0xd66d3440, 0x2b87e933, 0x80de5de2, 0x7775282e, 0xdc2c9cff, 0x21c6418c, 0x8a9ff55d, 0x714a4fbb, 0xda13fb6a, 0x27f92619, 0x8ca092c8, 0x520d45f8, 0xf954f129, 0x04be2c5a, 0xafe7988b, 0x5432226d, 0xff6b96bc, 0x02814bcf, 0xa9d8ff1e, 0x5e738ad2, 0xf52a3e03, 0x08c0e370, 0xa39957a1, 0x584ced47, 0xf3155996, 0x0eff84e5, 0xa5a63034, 0x4af0dbac, 0xe1a96f7d, 0x1c43b20e, 0xb71a06df, 0x4ccfbc39, 0xe79608e8, 0x1a7cd59b, 0xb125614a, 0x468e1486, 0xedd7a057, 0x103d7d24, 0xbb64c9f5, 0x40b17313, 0xebe8c7c2, 0x16021ab1, 0xbd5bae60, 0x6cb54671, 0xc7ecf2a0, 0x3a062fd3, 0x915f9b02, 0x6a8a21e4, 0xc1d39535, 0x3c394846, 0x9760fc97, 0x60cb895b, 0xcb923d8a, 0x3678e0f9, 0x9d215428, 0x66f4eece, 0xcdad5a1f, 0x3047876c, 0x9b1e33bd, 0x7448d825, 0xdf116cf4, 0x22fbb187, 0x89a20556, 0x7277bfb0, 0xd92e0b61, 0x24c4d612, 0x8f9d62c3, 0x7836170f, 0xd36fa3de, 0x2e857ead, 0x85dcca7c, 0x7e09709a, 0xd550c44b, 0x28ba1938, 0x83e3ade9, 0x5d4e7ad9, 0xf617ce08, 0x0bfd137b, 0xa0a4a7aa, 0x5b711d4c, 0xf028a99d, 0x0dc274ee, 0xa69bc03f, 0x5130b5f3, 0xfa690122, 0x0783dc51, 0xacda6880, 0x570fd266, 0xfc5666b7, 0x01bcbbc4, 0xaae50f15, 0x45b3e48d, 0xeeea505c, 0x13008d2f, 0xb85939fe, 0x438c8318, 0xe8d537c9, 0x153feaba, 0xbe665e6b, 0x49cd2ba7, 0xe2949f76, 0x1f7e4205, 0xb427f6d4, 0x4ff24c32, 0xe4abf8e3, 0x19412590, 0xb2189141, 0x0f433f21, 0xa41a8bf0, 0x59f05683, 0xf2a9e252, 0x097c58b4, 0xa225ec65, 0x5fcf3116, 0xf49685c7, 0x033df00b, 0xa86444da, 0x558e99a9, 0xfed72d78, 0x0502979e, 0xae5b234f, 0x53b1fe3c, 0xf8e84aed, 0x17bea175, 0xbce715a4, 0x410dc8d7, 0xea547c06, 0x1181c6e0, 0xbad87231, 0x4732af42, 0xec6b1b93, 0x1bc06e5f, 0xb099da8e, 0x4d7307fd, 0xe62ab32c, 0x1dff09ca, 0xb6a6bd1b, 0x4b4c6068, 0xe015d4b9, 0x3eb80389, 0x95e1b758, 0x680b6a2b, 0xc352defa, 0x3887641c, 0x93ded0cd, 0x6e340dbe, 0xc56db96f, 0x32c6cca3, 0x999f7872, 0x6475a501, 0xcf2c11d0, 0x34f9ab36, 0x9fa01fe7, 0x624ac294, 0xc9137645, 0x26459ddd, 0x8d1c290c, 0x70f6f47f, 0xdbaf40ae, 0x207afa48, 0x8b234e99, 0x76c993ea, 0xdd90273b, 0x2a3b52f7, 0x8162e626, 0x7c883b55, 0xd7d18f84, 0x2c043562, 0x875d81b3, 0x7ab75cc0, 0xd1eee811, ]; const U: &[u32; 256] = &[ 0x00000000, 0x7eb5200d, 0x5633f4cb, 0x2886d4c6, 0x073e5d47, 0x798b7d4a, 0x510da98c, 0x2fb88981, 0x0e7cba8e, 0x70c99a83, 0x584f4e45, 0x26fa6e48, 0x0942e7c9, 0x77f7c7c4, 0x5f711302, 0x21c4330f, 0x1cf9751c, 0x624c5511, 0x4aca81d7, 0x347fa1da, 0x1bc7285b, 0x65720856, 0x4df4dc90, 0x3341fc9d, 0x1285cf92, 0x6c30ef9f, 0x44b63b59, 0x3a031b54, 0x15bb92d5, 0x6b0eb2d8, 0x4388661e, 0x3d3d4613, 0x39f2ea38, 0x4747ca35, 0x6fc11ef3, 0x11743efe, 0x3eccb77f, 0x40799772, 0x68ff43b4, 0x164a63b9, 0x378e50b6, 0x493b70bb, 0x61bda47d, 0x1f088470, 0x30b00df1, 0x4e052dfc, 0x6683f93a, 0x1836d937, 0x250b9f24, 0x5bbebf29, 0x73386bef, 0x0d8d4be2, 0x2235c263, 0x5c80e26e, 0x740636a8, 0x0ab316a5, 0x2b7725aa, 0x55c205a7, 0x7d44d161, 0x03f1f16c, 0x2c4978ed, 0x52fc58e0, 0x7a7a8c26, 0x04cfac2b, 0x73e5d470, 0x0d50f47d, 0x25d620bb, 0x5b6300b6, 0x74db8937, 0x0a6ea93a, 0x22e87dfc, 0x5c5d5df1, 0x7d996efe, 0x032c4ef3, 0x2baa9a35, 0x551fba38, 0x7aa733b9, 0x041213b4, 0x2c94c772, 0x5221e77f, 0x6f1ca16c, 0x11a98161, 0x392f55a7, 0x479a75aa, 0x6822fc2b, 0x1697dc26, 0x3e1108e0, 0x40a428ed, 0x61601be2, 0x1fd53bef, 0x3753ef29, 0x49e6cf24, 0x665e46a5, 0x18eb66a8, 0x306db26e, 0x4ed89263, 0x4a173e48, 0x34a21e45, 0x1c24ca83, 0x6291ea8e, 0x4d29630f, 0x339c4302, 0x1b1a97c4, 0x65afb7c9, 0x446b84c6, 0x3adea4cb, 0x1258700d, 0x6ced5000, 0x4355d981, 0x3de0f98c, 0x15662d4a, 0x6bd30d47, 0x56ee4b54, 0x285b6b59, 0x00ddbf9f, 0x7e689f92, 0x51d01613, 0x2f65361e, 0x07e3e2d8, 0x7956c2d5, 0x5892f1da, 0x2627d1d7, 0x0ea10511, 0x7014251c, 0x5facac9d, 0x21198c90, 0x099f5856, 0x772a785b, 0x4c921c31, 0x32273c3c, 0x1aa1e8fa, 0x6414c8f7, 0x4bac4176, 0x3519617b, 0x1d9fb5bd, 0x632a95b0, 0x42eea6bf, 0x3c5b86b2, 0x14dd5274, 0x6a687279, 0x45d0fbf8, 0x3b65dbf5, 0x13e30f33, 0x6d562f3e, 0x506b692d, 0x2ede4920, 0x06589de6, 0x78edbdeb, 0x5755346a, 0x29e01467, 0x0166c0a1, 0x7fd3e0ac, 0x5e17d3a3, 0x20a2f3ae, 0x08242768, 0x76910765, 0x59298ee4, 0x279caee9, 0x0f1a7a2f, 0x71af5a22, 0x7560f609, 0x0bd5d604, 0x235302c2, 0x5de622cf, 0x725eab4e, 0x0ceb8b43, 0x246d5f85, 0x5ad87f88, 0x7b1c4c87, 0x05a96c8a, 0x2d2fb84c, 0x539a9841, 0x7c2211c0, 0x029731cd, 0x2a11e50b, 0x54a4c506, 0x69998315, 0x172ca318, 0x3faa77de, 0x411f57d3, 0x6ea7de52, 0x1012fe5f, 0x38942a99, 0x46210a94, 0x67e5399b, 0x19501996, 0x31d6cd50, 0x4f63ed5d, 0x60db64dc, 0x1e6e44d1, 0x36e89017, 0x485db01a, 0x3f77c841, 0x41c2e84c, 0x69443c8a, 0x17f11c87, 0x38499506, 0x46fcb50b, 0x6e7a61cd, 0x10cf41c0, 0x310b72cf, 0x4fbe52c2, 0x67388604, 0x198da609, 0x36352f88, 0x48800f85, 0x6006db43, 0x1eb3fb4e, 0x238ebd5d, 0x5d3b9d50, 0x75bd4996, 0x0b08699b, 0x24b0e01a, 0x5a05c017, 0x728314d1, 0x0c3634dc, 0x2df207d3, 0x534727de, 0x7bc1f318, 0x0574d315, 0x2acc5a94, 0x54797a99, 0x7cffae5f, 0x024a8e52, 0x06852279, 0x78300274, 0x50b6d6b2, 0x2e03f6bf, 0x01bb7f3e, 0x7f0e5f33, 0x57888bf5, 0x293dabf8, 0x08f998f7, 0x764cb8fa, 0x5eca6c3c, 0x207f4c31, 0x0fc7c5b0, 0x7172e5bd, 0x59f4317b, 0x27411176, 0x1a7c5765, 0x64c97768, 0x4c4fa3ae, 0x32fa83a3, 0x1d420a22, 0x63f72a2f, 0x4b71fee9, 0x35c4dee4, 0x1400edeb, 0x6ab5cde6, 0x42331920, 0x3c86392d, 0x133eb0ac, 0x6d8b90a1, 0x450d4467, 0x3bb8646a, ]; // Result type for functions that have multiple failure modes #[derive(Debug)] pub enum DeltaError { Io(std::io::Error), // An IO error occurred DeltaTooLarge, // The delta is too large to be encoded } impl std::fmt::Display for DeltaError { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { match self { DeltaError::Io(err) => write!(f, "IO error: {}", err), DeltaError::DeltaTooLarge => write!(f, "Delta too large"), } } } impl From for DeltaError { fn from(err: std::io::Error) -> DeltaError { DeltaError::Io(err) } } #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub struct RabinHash(u32); impl RabinHash { pub fn pushright(&mut self, c: u8) { self.0 = ((self.0 << 8) | c as u32) ^ T[(self.0 >> RABIN_SHIFT) as usize]; } pub fn popleft(&mut self, c: u8) { self.0 ^= U[c as usize]; } pub fn finish(&self) -> u32 { self.0 } } impl From for u32 { fn from(val: RabinHash) -> u32 { val.0 } } pub fn rabin_hash(data: [u8; RABIN_WINDOW]) -> RabinHash { assert_eq!(data.len(), RABIN_WINDOW); let mut val = RabinHash(0); for c in data.iter().take(RABIN_WINDOW) { val.pushright(*c); } val } pub struct RabinWindow { data: [u8; RABIN_WINDOW], pos: usize, hash: RabinHash, } impl RabinWindow { pub fn new(data: [u8; RABIN_WINDOW]) -> Self { let hash = rabin_hash(data); RabinWindow { data, hash, pos: 0 } } pub fn push(&mut self, c: u8) { self.hash.pushright(c); self.hash.popleft(self.data[self.pos]); self.data[self.pos] = c; self.pos = (self.pos + 1) % RABIN_WINDOW; } pub fn hash(&self) -> RabinHash { self.hash } } #[derive(Debug, Clone)] pub struct DeltaIndex<'a> { entries: HashMap>>, last_offset: usize, } #[derive(Debug, Clone, Copy, Default)] pub struct IndexEntry<'a> { /// Absolute offset pub offset: usize, pub data: &'a [u8], } impl IndexEntry<'_> { pub fn add(&self, offset: usize) -> Self { Self { offset: self.offset + offset, data: &self.data[offset..], } } } impl Default for DeltaIndex<'_> { fn default() -> Self { Self::new() } } impl<'a> DeltaIndex<'a> { pub fn iter_matches(&self, val: &RabinHash) -> impl Iterator> + '_ { self.entries .get(&val.finish()) .into_iter() .flat_map(|v| v.iter()) } fn find_match( &self, hash: RabinHash, data: &[u8], mut min_size: usize, good_enough_size: Option, ) -> Option<(IndexEntry<'a>, usize)> { let mut msource = None; for entry in self.iter_matches(&hash) { if entry.data.len() <= min_size { // no point in checking this one continue; } let overlap = entry .data .iter() .zip(data.iter()) .take_while(|(x, y)| x == y) .count(); if overlap > min_size { /* this is our best match so far */ min_size = overlap; msource = Some(*entry); if let Some(good_enough_size) = good_enough_size { if min_size >= good_enough_size { /* good enough */ return Some((msource.unwrap(), min_size)); } } } } msource.map(|s| (s, min_size)) } pub fn new() -> Self { Self { entries: HashMap::new(), last_offset: 0, } } pub fn add_delta(&mut self, mut delta: &'a [u8], unused_bytes: usize) -> std::io::Result<()> { read_base128_int(&mut delta)?; let mut pos = 0; while !delta.is_empty() { pos = match decode_instruction(&delta[pos..], 0) .map_err(|e| std::io::Error::new(std::io::ErrorKind::InvalidData, e))? { (Instruction::Copy { .. }, pos) => pos, (Instruction::Insert(data), pos) => { // The create_delta code requires a match at least 4 characters // (including only the last char of the RABIN_WINDOW) before it // will consider it something worth copying rather than inserting. // So we don't want to index anything that we know won't ever be a // match. for i in 0..data.len() - 4 { let val = rabin_hash(data[i..i + RABIN_WINDOW].try_into().unwrap()); self.entries .entry(val.into()) .or_default() .push(IndexEntry::<'a> { offset: self.last_offset + pos, data: &data[i..], }) } pos } } } self.last_offset += pos + unused_bytes; Ok(()) } // Compute index data from given buffer // // # Arguments // // * `max_bytes_to_index`: Limit the number of regions to sample to this // amount of text. We will store at most max_bytes_to_index / RABIN_WINDOW // pointers into the source text. Useful if src can be unbounded in size, // and you are willing to trade match accuracy for peak memory. pub fn add_fulltext( &mut self, src: &'a [u8], unused_bytes: usize, max_bytes_to_index: Option, ) { let stride = if let Some(max_bytes_to_index) = max_bytes_to_index { std::cmp::min(max_bytes_to_index, src.len()) / RABIN_WINDOW } else { RABIN_WINDOW }; let mut prev_val = None; for i in (0..(src.len().max(RABIN_WINDOW) - RABIN_WINDOW)).step_by(stride) { let val = rabin_hash(src[i..i + RABIN_WINDOW].try_into().unwrap()); if Some(val) == prev_val { // keep the lowest of consecutive identical hashes } else { prev_val = Some(val); self.entries .entry(val.into()) .or_default() .push(IndexEntry::<'a> { offset: self.last_offset + i, data: &src[i..], }) } } self.last_offset += src.len() + unused_bytes; } } pub fn iter_delta_instructions<'a>( index: &'a DeltaIndex<'a>, mut target: &'a [u8], ) -> impl Iterator> + 'a { assert!(target.len() >= RABIN_WINDOW); // Start the matching by filling out with a simple 'insert' instruction, of // the first RABIN_WINDOW bytes of the input. let mut block = &target[..RABIN_WINDOW]; let mut window = RabinWindow::new(block.try_into().unwrap()); let mut msize = 0; let mut msource: Option> = None; std::iter::from_fn(move || -> Option> { while target.len() > block.len() { if msize < 4096 { // we don't have a 'worthy enough' match yet, so let's look for // one. // Shift the window by one byte. (msource, msize) = index .find_match(window.hash(), target, msize, Some(4096)) .map_or((msource, msize), |(source, msize)| (Some(source), msize)); } if msize < 4 { // The best match right now is less than 4 bytes long. So just add // the current byte to the insert instruction. Increment the insert // counter, and copy the byte of data into the output buffer. block = &target[..block.len() + 1]; window.push(block[block.len() - 1]); msize = 0; if block.len() == MAX_INSERT_SIZE { // We have a max length insert instruction, finalize it in the // output. target = &target[block.len()..]; let old_block = block; block = &[]; return Some(Instruction::Insert(old_block)); } } else { let region = msource.unwrap(); assert!(msize <= region.data.len()); let copy_len = msize.min(MAX_COPY_SIZE); msize -= copy_len; msource = Some(region.add(copy_len)); target = &target[copy_len..]; block = &[]; if msize < 4096 { // Keep the window in sync with the target buffer. for c in ®ion.data[(copy_len - RABIN_WINDOW).min(0)..] { window.push(*c); } } return Some(Instruction::Copy { offset: region.offset, length: copy_len, }); } } if !block.is_empty() { let old_block = block; block = &[]; target = &[]; return Some(Instruction::Insert(old_block)); } None }) } pub fn create_delta<'a, W: Write>( mut writer: W, index: &DeltaIndex<'a>, target: &'a [u8], max_delta_size: Option, ) -> Result<(), DeltaError> { let mut size = 0; // store target buffer size size += write_base128_int(&mut writer, target.len() as u128)?; if target.len() < RABIN_WINDOW { // If the target is smaller than the Rabin window, we can't do any // matching, so just write out the whole target as an insert instruction. size += write_instruction(&mut writer, &Instruction::Insert(target))?; if let Some(max_delta_size) = max_delta_size { if size > max_delta_size { return Err(DeltaError::DeltaTooLarge); } } } else { for instruction in iter_delta_instructions(index, target) { size += write_instruction(&mut writer, &instruction)?; if let Some(max_delta_size) = max_delta_size { if size > max_delta_size { return Err(DeltaError::DeltaTooLarge); } } } } Ok(()) } /// Create a delta, this is a wrapper around DeltaIndex.make_delta. pub fn make_delta(source_bytes: &[u8], target_bytes: &[u8]) -> Vec { let mut out = Vec::new(); let mut di = DeltaIndex::new(); di.add_fulltext(source_bytes, 0, None); create_delta(&mut out, &di, target_bytes, None).unwrap(); out } #[cfg(test)] mod tests { const TEXT1: &[u8] = b"This is a bit of source text which is meant to be matched against other text "; const TEXT2: &[u8] = b"This is a bit of source text which is meant to differ from against other text "; const TEXT3: &[u8] = b"This is a bit of source text which is meant to be matched against other text except it also has a lot more data at the end of the file "; const FIRST_TEXT: &[u8] = b"a bit of text, that does not have much in common with the next text "; const SECOND_TEXT: &[u8] = b"some more bit of text, that does not have much in common with the previous text and has some extra text "; const THIRD_TEXT: &[u8] = b"a bit of text, that has some in common with the previous text and has some extra text and not have much in common with the next text "; const FOURTH_TEXT: &[u8] = b"123456789012345 same rabin hash 123456789012345 same rabin hash 123456789012345 same rabin hash 123456789012345 same rabin hash "; fn assert_delta(source: &[u8], target: &[u8], delta: &[u8]) { let mut di = super::DeltaIndex::new(); di.add_fulltext(source, 0, None); let mut out = Vec::new(); super::create_delta(&mut out, &di, target, None).unwrap(); assert_eq!( delta, &out[..], "delta: {:?}", super::iter_delta_instructions(&di, target).collect::>() ); } #[test] fn test_make_noop_delta() { assert_delta(TEXT1, TEXT1, b"M\x90M"); assert_delta(TEXT2, TEXT2, b"N\x90N"); assert_delta(TEXT3, TEXT3, b"\x87\x01\x90\x87"); } #[test] fn test_make_delta() { assert_delta(TEXT1, TEXT2, b"N\x90/\x1fdiffer from\nagainst other text\n"); assert_delta(TEXT2, TEXT1, b"M\x90/\x1ebe matched\nagainst other text\n"); assert_delta(TEXT3, TEXT1, b"M\x90M"); assert_delta(TEXT3, TEXT2, b"N\x90/\x1fdiffer from\nagainst other text\n"); } #[test] fn test_make_delta_with_large_copies() { // We want to have a copy that is larger than 64kB, which forces us to // issue multiple copy instructions. let big_text = TEXT3.repeat(1220); assert_delta( big_text.as_slice(), big_text.as_slice(), vec![ &b"\xdc\x86\x0a"[..], // Encoding the length of the uncompressed text &b"\x80"[..], // Copy 64kB, starting at byte 0 &b"\x84\x01"[..], // and another 64kB starting at 64kB &b"\xb4\x02\x5c\x83"[..], // And the bit of tail. ] .concat() .as_slice(), ) } } bzrformats_3.4.0.orig/crates/bazaar/src/smart/mod.rs0000644000000000000000000000002215162074037017440 0ustar00pub mod protocol; bzrformats_3.4.0.orig/crates/bazaar/src/smart/protocol.rs0000644000000000000000000000102515162074037020526 0ustar00// Protocol version strings. These are sent as prefixes of bzr requests and // responses to identify the protocol version being used. (There are no version // one strings because that version doesn't send any). pub const REQUEST_VERSION_TWO: &[u8] = b"bzr request 2\n"; pub const RESPONSE_VERSION_TWO: &[u8] = b"bzr response 2\n"; pub const MESSAGE_VERSION_THREE: &[u8] = b"bzr message 3 (bzr 1.6)\n"; pub const REQUEST_VERSION_THREE: &[u8] = MESSAGE_VERSION_THREE; pub const RESPONSE_VERSION_THREE: &[u8] = MESSAGE_VERSION_THREE; bzrformats_3.4.0.orig/crates/osutils-py/Cargo.toml0000644000000000000000000000040515162074037017223 0ustar00[package] name = "bzrformats-osutils" version = "0.1.0" edition = "2021" [lib] name = "_osutils_rs" crate-type = ["cdylib"] [dependencies] pyo3 = { version = ">=0.26,<0.28", features = ["extension-module"] } osutils = { path = "../osutils-rs" } walkdir = "2" bzrformats_3.4.0.orig/crates/osutils-py/src/0000755000000000000000000000000014414043471016060 5ustar00bzrformats_3.4.0.orig/crates/osutils-py/src/lib.rs0000644000000000000000000000674315162213673017212 0ustar00use pyo3::prelude::*; use pyo3::types::{PyBytes, PyList, PyModule}; use std::path::{Path, PathBuf}; #[pyfunction] fn split_lines<'a>(py: Python<'a>, text: &'a [u8]) -> PyResult> { let ret = PyList::empty(py); for line in osutils::split_lines(text) { let line_bytes = PyBytes::new(py, &line); ret.append(line_bytes)?; } Ok(ret) } #[pyfunction] fn rand_chars(num: usize) -> PyResult { Ok(osutils::rand_chars(num)) } #[pyfunction] fn is_inside(dir: &str, fname: &str) -> PyResult { let dir_path = Path::new(dir); let fname_path = Path::new(fname); Ok(osutils::path::is_inside(dir_path, fname_path)) } #[pyfunction] fn is_inside_any(dir_list: Vec, fname: &str) -> PyResult { let dir_paths: Vec<&Path> = dir_list.iter().map(|d| Path::new(d.as_str())).collect(); let fname_path = Path::new(fname); Ok(osutils::path::is_inside_any(&dir_paths, fname_path)) } #[pyfunction] fn parent_directories(path: &str) -> PyResult> { let path_obj = Path::new(path); let parents: Vec = osutils::path::parent_directories(path_obj) .map(|p| p.to_string_lossy().to_string()) .collect(); Ok(parents) } // Walkdirs implementation - simplified version for basic functionality #[pyfunction] fn walkdirs_utf8(top: &str) -> PyResult)>> { use std::fs; let mut results = Vec::new(); let walk = walkdir::WalkDir::new(top).follow_links(false); for entry in walk { let entry = entry.map_err(|e| pyo3::exceptions::PyIOError::new_err(e.to_string()))?; let path = entry.path(); if path.is_dir() { let mut dir_entries = Vec::new(); // Read directory contents if let Ok(read_dir) = fs::read_dir(path) { for dir_entry in read_dir.flatten() { let name = dir_entry.file_name().to_string_lossy().to_string(); let metadata = dir_entry.metadata(); if let Ok(metadata) = metadata { let kind = if metadata.is_dir() { "directory" } else if metadata.is_symlink() { "symlink" } else { "file" }; let size = metadata.len(); let utf8path = dir_entry.path().to_string_lossy().to_string(); dir_entries.push((name, kind.to_string(), size, utf8path)); } } } results.push((path.to_string_lossy().to_string(), dir_entries)); } } Ok(results) } #[pyfunction] fn normalizes_filenames() -> bool { osutils::path::normalizes_filenames() } #[pyfunction] fn supports_symlinks(path: PathBuf) -> Option { osutils::mounts::supports_symlinks(path) } #[pymodule] fn _osutils_rs(_py: Python, m: &Bound) -> PyResult<()> { m.add_function(wrap_pyfunction!(split_lines, m)?)?; m.add_function(wrap_pyfunction!(rand_chars, m)?)?; m.add_function(wrap_pyfunction!(is_inside, m)?)?; m.add_function(wrap_pyfunction!(is_inside_any, m)?)?; m.add_function(wrap_pyfunction!(parent_directories, m)?)?; m.add_function(wrap_pyfunction!(walkdirs_utf8, m)?)?; m.add_function(wrap_pyfunction!(normalizes_filenames, m)?)?; m.add_function(wrap_pyfunction!(supports_symlinks, m)?)?; Ok(()) } bzrformats_3.4.0.orig/crates/osutils-rs/Cargo.toml0000644000000000000000000000073715162206230017217 0ustar00[package] name = "osutils" version = "0.1.0" edition = "2021" description = "Low level OS wrappers for bzrformats" license = "GPL-2.0+" [lib] [dependencies] memchr = "2.7.4" rand = "0.9" sha1 = "0.10" unicode-normalization = "0.1.19" pyo3 = { version = ">=0.26,<0.29", optional = true } chrono = "0.4" lazy_static = "1" log = "0.4" [target.'cfg(windows)'.dependencies] winapi = { version = "0.3", features = ["fileapi", "minwindef", "winnt"] } [features] pyo3 = ["dep:pyo3"] bzrformats_3.4.0.orig/crates/osutils-rs/src/0000755000000000000000000000000015162074037016057 5ustar00bzrformats_3.4.0.orig/crates/osutils-rs/src/chunkreader.rs0000644000000000000000000000455215162074037020726 0ustar00use std::borrow::Borrow; use std::io::Read; pub struct ChunksReader> { chunks: Box>, current_chunk: Option, position: usize, } impl> ChunksReader { pub fn new(chunks: Box>) -> Self { ChunksReader { chunks, position: 0, current_chunk: None, } } } impl> Read for ChunksReader { fn read(&mut self, buf: &mut [u8]) -> std::io::Result { let mut bytes_read = 0; while bytes_read < buf.len() { if let Some(chunk) = self.current_chunk.as_ref() { let bytes_to_copy = (buf.len() - bytes_read).min(chunk.borrow().len() - self.position); buf[bytes_read..bytes_read + bytes_to_copy] .copy_from_slice(&chunk.borrow()[self.position..self.position + bytes_to_copy]); self.position += bytes_to_copy; bytes_read += bytes_to_copy; if self.position == chunk.borrow().len() { self.current_chunk = None; } } else if let Some(chunk) = self.chunks.next() { self.current_chunk = Some(chunk); self.position = 0; } else { break; } } Ok(bytes_read) } } #[test] fn test_chunks_reader_vec() { let chunks = vec![vec![1, 2, 3], vec![4, 5, 6], vec![7, 8, 9]]; let mut reader = ChunksReader::new(Box::new(chunks.into_iter())); let mut buf = [0; 4]; assert_eq!(reader.read(&mut buf).unwrap(), 4); assert_eq!(buf, [1, 2, 3, 4]); assert_eq!(reader.read(&mut buf).unwrap(), 4); assert_eq!(buf, [5, 6, 7, 8]); assert_eq!(reader.read(&mut buf).unwrap(), 1); assert_eq!(buf[0], 9); assert_eq!(reader.read(&mut buf).unwrap(), 0); } #[test] fn test_chunks_reader_slice() { let chunks = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]; let mut reader = ChunksReader::new(Box::new(chunks.into_iter())); let mut buf = [0; 4]; assert_eq!(reader.read(&mut buf).unwrap(), 4); assert_eq!(buf, [1, 2, 3, 4]); assert_eq!(reader.read(&mut buf).unwrap(), 4); assert_eq!(buf, [5, 6, 7, 8]); assert_eq!(reader.read(&mut buf).unwrap(), 1); assert_eq!(buf[0], 9); assert_eq!(reader.read(&mut buf).unwrap(), 0); } bzrformats_3.4.0.orig/crates/osutils-rs/src/file.rs0000644000000000000000000002117215162074037017347 0ustar00use core::ops::BitAnd; use log::debug; #[cfg(unix)] use nix::sys::stat::SFlag; use std::fs::{set_permissions, symlink_metadata, Permissions}; use std::io::Read; use std::io::Result; use std::path::Path; use walkdir::WalkDir; pub fn make_writable>(path: P) -> Result<()> { let path = path.as_ref(); let metadata = std::fs::symlink_metadata(path)?; let mut permissions = metadata.permissions(); if !metadata.file_type().is_symlink() { permissions.set_readonly(false); set_permissions(path, permissions)?; } Ok(()) } pub fn make_readonly>(path: P) -> Result<()> { let path = path.as_ref(); let metadata = std::fs::symlink_metadata(path)?; let mut permissions = metadata.permissions(); if !metadata.file_type().is_symlink() { permissions.set_readonly(true); set_permissions(path, permissions)?; } Ok(()) } pub fn chmod_if_possible>(path: P, permissions: Permissions) -> Result<()> { // Set file mode if that can be safely done. // Sometimes even on unix the filesystem won't allow it - see // https://bugs.launchpad.net/bzr/+bug/606537 if let Err(e) = set_permissions(path.as_ref(), permissions) { // Permission/access denied seems to commonly happen on smbfs; there's // probably no point warning about it. // match e.kind() { std::io::ErrorKind::PermissionDenied => { debug!("ignore error on chmod of {:?}: {:?}", path.as_ref(), e); Ok(()) } _ => Err(e), } } else { Ok(()) } } #[cfg(unix)] pub fn copy_ownership_from_path>(dst: P, src: Option<&Path>) -> Result<()> { use nix::unistd::{chown, Gid, Uid}; use std::os::unix::fs::MetadataExt; let mut src = match src { Some(p) => p, None => dst.as_ref().parent().unwrap_or_else(|| Path::new(".")), }; if src == Path::new("") { src = Path::new("."); } let s = std::fs::metadata(src)?; let uid = s.uid(); let gid = s.gid(); if let Err(err) = chown( dst.as_ref(), Some(Uid::from_raw(uid)), Some(Gid::from_raw(gid)), ) { debug!( "Unable to copy ownership from \"{}\" to \"{}\". \ You may want to set it manually. {}", src.display(), dst.as_ref().display(), err ); } Ok(()) } pub fn is_dir(f: &std::path::Path) -> bool { match std::fs::symlink_metadata(f) { Ok(metadata) => metadata.is_dir(), Err(_) => false, } } pub fn is_file(f: &std::path::Path) -> bool { match std::fs::symlink_metadata(f) { Ok(metadata) => metadata.is_file(), Err(_) => false, } } pub fn is_link(f: &std::path::Path) -> bool { match std::fs::symlink_metadata(f) { Ok(metadata) => metadata.file_type().is_symlink(), Err(_) => false, } } #[cfg(unix)] pub fn link_or_copy, Q: AsRef>(src: P, dest: Q) -> std::io::Result<()> { let src = src.as_ref(); let dest = dest.as_ref(); match std::fs::hard_link(src, dest) { Ok(_) => Ok(()), Err(e) => { // TODO(jelmer): This should really be checking for // e.kind() != std::io::ErrorKind::CrossesDeviceBoundary{ // See https://github.com/rust-lang/rust/issues/86442 if e.kind() != std::io::ErrorKind::Other { return Err(e); } std::fs::copy(src, dest)?; Ok(()) } } } #[cfg(any(target_os = "windows"))] pub fn link_or_copy, Q: AsRef>(src: P, dest: Q) -> std::io::Result<()> { std::fs::copy(src.as_ref(), dest.as_ref())?; } pub fn copy_tree, Q: AsRef>(from_path: P, to_path: Q) -> std::io::Result<()> { for entry in WalkDir::new(from_path.as_ref()) { let entry = entry?; let path = entry.path(); let dst_path = to_path .as_ref() .join(path.strip_prefix(from_path.as_ref()).unwrap()); if entry.file_type().is_dir() { match std::fs::create_dir(&dst_path) { Ok(_) => {} Err(e) => { if e.kind() != std::io::ErrorKind::AlreadyExists || dst_path != to_path.as_ref() { return Err(e); } } } } else if entry.file_type().is_file() { std::fs::copy(path, dst_path)?; } else if entry.file_type().is_symlink() { let target = std::fs::read_link(path)?; let target = target .strip_prefix(from_path.as_ref()) .unwrap_or(target.as_path()); #[cfg(unix)] std::os::unix::fs::symlink(target, dst_path)?; #[cfg(windows)] std::os::windows::fs::symlink_file(target, dst_path)?; } else { return Err(std::io::Error::new( std::io::ErrorKind::Other, format!("Unsupported file type: {:?}", entry.file_type()), )); } } Ok(()) } const FORMATS: [(SFlag, &str); 7] = [ (SFlag::S_IFDIR, "directory"), (SFlag::S_IFCHR, "chardev"), (SFlag::S_IFBLK, "block"), (SFlag::S_IFREG, "file"), (SFlag::S_IFIFO, "fifo"), (SFlag::S_IFLNK, "symlink"), (SFlag::S_IFSOCK, "socket"), ]; pub fn kind_from_mode(mode: SFlag) -> &'static str { for (format_mode, format_kind) in FORMATS.iter() { if mode.bitand(SFlag::S_IFMT) == *format_mode { return format_kind; } } "unknown" } pub fn delete_any>(path: P) -> std::io::Result<()> { fn delete_file_or_dir>(path: P) -> std::io::Result<()> { let path = path.as_ref(); // Look Before You Leap (LBYL) is appropriate here instead of Easier to Ask for // Forgiveness than Permission (EAFP) because: // - root can damage a solaris file system by using unlink, // - unlink raises different exceptions on different OSes (linux: EISDIR, win32: // EACCES, OSX: EPERM) when invoked on a directory. if path.is_dir() { std::fs::remove_dir(path) } else { std::fs::remove_file(path) } } // handle errors due to read-only files/directories match delete_file_or_dir(path.as_ref()) { Ok(()) => Ok(()), Err(ref e) if e.kind() == std::io::ErrorKind::PermissionDenied => { if let Err(e) = make_writable(path.as_ref()) { debug!("Unable to make {:?} writable: {}", path.as_ref(), e); } delete_file_or_dir(path.as_ref()) } Err(e) => Err(e), } } pub fn file_iterator( input_file: F, readsize: Option, ) -> impl Iterator> { let readsize = readsize.unwrap_or(32768); let mut buffer = vec![0; readsize]; let mut reader = std::io::BufReader::new(input_file); std::iter::from_fn(move || match reader.read(&mut buffer) { Ok(0) => None, Ok(n) => Some(buffer[..n].to_vec()), Err(_) => None, }) } pub fn ensure_empty_directory_exists(path: &Path) -> std::io::Result<()> { match std::fs::create_dir(path) { Ok(_) => Ok(()), Err(e) => { if e.kind() != std::io::ErrorKind::AlreadyExists { Err(e) } else { let dir_entries = std::fs::read_dir(path)?; if dir_entries.count() > 0 { Err(std::io::Error::new( // TODO(jelmer): Switch to DirectoryNotEmpty once available: // std::io::ErrorKind::DirectoryNotEmpty, std::io::ErrorKind::Other, format!("Directory {:?} is not empty", path), )) } else { Ok(()) } } } } } pub fn lexists(path: &Path) -> std::io::Result { symlink_metadata(path).map(|_| true).or(Ok(false)) } pub fn compare_files(mut a: T, mut b: U) -> std::io::Result { const BUFSIZE: usize = 4096; let mut buf_a = [0; BUFSIZE]; let mut buf_b = [0; BUFSIZE]; loop { let n_a = a.read(&mut buf_a)?; let n_b = b.read(&mut buf_b)?; if buf_a[..n_a] != buf_b[..n_b] { return Ok(false); } if n_a == 0 { return Ok(n_b == 0); } } } pub fn isdir(f: &Path) -> bool { match std::fs::symlink_metadata(f) { Ok(metadata) => metadata.is_dir(), Err(_) => false, } } bzrformats_3.4.0.orig/crates/osutils-rs/src/iterablefile.rs0000644000000000000000000001411715162074037021060 0ustar00use std::io::{self, BufRead, Read, Seek, SeekFrom}; pub struct IterableFile>> + Send + Sync> { iter: I, buffer: Vec, } impl>> + Send + Sync> IterableFile { pub fn new(iter: I) -> Self { IterableFile { iter, buffer: Vec::new(), } } } impl>> + Send + Sync> Read for IterableFile { fn read(&mut self, buf: &mut [u8]) -> io::Result { let n = self.fill_buf()?.read(buf)?; self.consume(n); Ok(n) } } impl>> + Send + Sync> BufRead for IterableFile { fn fill_buf(&mut self) -> io::Result<&[u8]> { while self.buffer.is_empty() { if let Some(bytes) = self.iter.next() { self.buffer = bytes?; } else { break; } } Ok(&self.buffer) } fn consume(&mut self, amt: usize) { self.buffer.drain(..amt); } } impl>> + Seek + Send + Sync> Seek for IterableFile { fn seek(&mut self, pos: SeekFrom) -> io::Result { match pos { SeekFrom::Start(n) => { self.iter.seek(SeekFrom::Start(n))?; self.buffer.clear(); } SeekFrom::Current(n) => { if n >= 0 { let mut skip = n as usize; while skip > 0 { let buf = self.fill_buf()?; if buf.is_empty() { break; } let n = std::cmp::min(skip, buf.len()); self.consume(n); skip -= n; } } else { self.seek(SeekFrom::End(n))?; } } SeekFrom::End(n) => { let mut pos = self.iter.seek(SeekFrom::End(0))? as i64; pos += n; if pos < 0 { return Err(io::Error::new( io::ErrorKind::InvalidInput, "invalid seek to a negative or overflowing position", )); } self.iter.seek(SeekFrom::Start(pos as u64))?; self.buffer.clear(); } } self.iter.stream_position() } } #[cfg(test)] mod tests { use super::*; #[test] fn test_read_all() { let content: Vec> = vec![ b"This ".to_vec(), b"is ".to_vec(), b"a ".to_vec(), b"test.".to_vec(), ]; let mut file = IterableFile::new(content.iter().map(|x| Ok(x.to_vec()))); let mut buf = Vec::new(); let read = file.read_to_end(&mut buf).unwrap(); assert_eq!(read, 15); assert_eq!(&buf, b"This is a test."); } #[test] fn test_read_n() { let content: Vec> = vec![ b"This ".to_vec(), b"is ".to_vec(), b"a ".to_vec(), b"test.".to_vec(), ]; let mut file = IterableFile::new(content.iter().map(|x| Ok(x.to_vec()))); let mut buf = [0u8; 8]; file.read_exact(&mut buf).unwrap(); assert_eq!(&buf, b"This is "); } #[test] fn test_read_to() { let content: Vec> = vec![ b"This\n".to_vec(), b"is ".to_vec(), b"a ".to_vec(), b"test.\n".to_vec(), ]; let mut file = IterableFile::new(content.iter().map(|x| Ok(x.to_vec()))); let mut buf = Vec::new(); file.read_until(b'\n', &mut buf).unwrap(); assert_eq!(&buf, b"This\n"); buf.clear(); let read = file.read_until(b'\n', &mut buf).unwrap(); assert_eq!(read, 11); assert_eq!(&buf, b"is a test.\n"); } #[test] fn test_readline() { let content: Vec> = vec![ b"".to_vec(), b"This\n".to_vec(), b"is ".to_vec(), b"a ".to_vec(), b"test.\n".to_vec(), ]; let mut file = IterableFile::new(content.iter().map(|x| Ok(x.to_vec()))); let mut buf = String::new(); let read = file.read_line(&mut buf).unwrap(); assert_eq!(read, 5); assert_eq!(&buf, "This\n"); } #[test] fn test_readlines() { let content: Vec> = vec![ b"This\n".to_vec(), b"is ".to_vec(), b"".to_vec(), b"a ".to_vec(), b"test.\n".to_vec(), ]; let file = IterableFile::new(content.iter().map(|x| Ok(x.to_vec()))); let lines: Vec = file.lines().map(|line| line.unwrap()).collect(); assert_eq!(lines, vec!["This", "is a test."]); } #[test] fn test_fillbuf() { let content: Vec> = vec![ b"This ".to_vec(), b"".to_vec(), b"is ".to_vec(), b"a ".to_vec(), b"test.".to_vec(), ]; let mut file = IterableFile::new(content.iter().map(|x| Ok(x.to_vec()))); assert_eq!(file.fill_buf().unwrap(), b"This "); file.consume(5); assert_eq!(file.fill_buf().unwrap(), b"is "); file.consume(3); assert_eq!(file.fill_buf().unwrap(), b"a "); file.consume(2); assert_eq!(file.fill_buf().unwrap(), b"test."); file.consume(5); assert!(file.fill_buf().unwrap().is_empty()); } #[test] fn test_drain() { let content: Vec> = vec![ b"This ".to_vec(), b"is ".to_vec(), b"a ".to_vec(), b"test.".to_vec(), ]; let mut file = IterableFile::new(content.iter().map(|x| Ok(x.to_vec()))); let buf = file.fill_buf().unwrap(); assert_eq!(buf, b"This "); file.consume(5); let buf = file.fill_buf().unwrap(); assert_eq!(buf, b"is "); file.consume(1); let buf = file.fill_buf().unwrap(); assert_eq!(buf, b"s "); } } bzrformats_3.4.0.orig/crates/osutils-rs/src/lib.rs0000644000000000000000000001555715162074037017210 0ustar00use memchr::memchr; use rand::Rng; use std::borrow::Cow; pub fn is_well_formed_line(line: &[u8]) -> bool { if line.is_empty() { return false; } memchr(b'\n', line) == Some(line.len() - 1) } pub trait AsCow<'a, T: ToOwned + ?Sized> { fn as_cow(self) -> Cow<'a, T>; } impl<'a> AsCow<'a, [u8]> for &'a [u8] { fn as_cow(self) -> Cow<'a, [u8]> { Cow::Borrowed(self) } } impl<'a> AsCow<'a, [u8]> for Cow<'a, [u8]> { fn as_cow(self) -> Cow<'a, [u8]> { self } } impl<'a> AsCow<'a, [u8]> for Vec { fn as_cow(self) -> Cow<'a, [u8]> { Cow::Owned(self) } } impl<'a> AsCow<'a, [u8]> for &'a Vec { fn as_cow(self) -> Cow<'a, [u8]> { Cow::Borrowed(self.as_slice()) } } pub fn chunks_to_lines<'a, C, I, E>(chunks: I) -> impl Iterator, E>> where I: Iterator> + 'a, C: AsCow<'a, [u8]> + 'a, E: std::fmt::Debug, { pub struct ChunksToLines<'a, C, E> where C: AsCow<'a, [u8]>, E: std::fmt::Debug, { chunks: Box> + 'a>, tail: Vec, } impl<'a, C, E: std::fmt::Debug> Iterator for ChunksToLines<'a, C, E> where C: AsCow<'a, [u8]>, { type Item = Result, E>; fn next(&mut self) -> Option { loop { // See if we can find a line in tail if let Some(newline) = memchr(b'\n', &self.tail) { // The chunk contains multiple lines, so split it into lines let line = Cow::Owned(self.tail[..=newline].to_vec()); self.tail.drain(..=newline); return Some(Ok(line)); } else { // We couldn't find a newline if let Some(next_chunk) = self.chunks.next() { match next_chunk { Err(e) => { return Some(Err(e)); } Ok(next_chunk) => { let next_chunk = next_chunk.as_cow(); // If the chunk is well-formed, return it if self.tail.is_empty() && is_well_formed_line(next_chunk.as_ref()) { return Some(Ok(next_chunk)); } else { self.tail.extend_from_slice(next_chunk.as_ref()); } } } } else { // We've reached the end of the chunks, so return the last chunk if self.tail.is_empty() { return None; } let line = Cow::Owned(self.tail.to_vec()); self.tail.clear(); return Some(Ok(line)); } } } } } ChunksToLines { chunks: Box::new(chunks), tail: Vec::new(), } } #[test] fn test_chunks_to_lines() { assert_eq!( chunks_to_lines(vec![Ok::<_, std::io::Error>("foo\nbar".as_bytes().as_cow())].into_iter()) .map(|x| x.unwrap()) .collect::>(), vec!["foo\n".as_bytes().as_cow(), "bar".as_bytes().as_cow()] ); } pub fn split_lines(text: &[u8]) -> impl Iterator> { pub struct SplitLines<'a> { text: &'a [u8], } impl<'a> Iterator for SplitLines<'a> { type Item = Cow<'a, [u8]>; fn next(&mut self) -> Option { if self.text.is_empty() { return None; } if let Some(newline) = memchr(b'\n', self.text) { let line = Cow::Borrowed(&self.text[..=newline]); self.text = &self.text[newline + 1..]; Some(line) } else { // No newline found, so return the rest of the text let line = Cow::Borrowed(self.text); self.text = &self.text[self.text.len()..]; Some(line) } } } SplitLines { text } } #[test] fn test_split_lines() { assert_eq!( split_lines("foo\nbar".as_bytes()) .map(|x| x.to_vec()) .collect::>(), vec!["foo\n".as_bytes().to_vec(), "bar".as_bytes().to_vec()] ); } const ALNUM: &str = "0123456789abcdefghijklmnopqrstuvwxyz"; pub fn rand_chars(num: usize) -> String { let mut rng = rand::rng(); let mut s = String::new(); for _ in 0..num { let raw_byte = rng.random_range(0..256); s.push(ALNUM.chars().nth(raw_byte % 36).unwrap()); } s } pub fn contains_whitespace(s: &str) -> bool { let ws = " \t\n\r\u{000B}\u{000C}"; for ch in ws.chars() { if s.contains(ch) { return true; } } false } #[derive(Debug, PartialEq)] pub enum Kind { File, Directory, Symlink, TreeReference, } impl Kind { pub fn marker(&self) -> &'static str { match self { Kind::File => "", Kind::Directory => "/", Kind::Symlink => "@", Kind::TreeReference => "+", } } pub fn to_string(&self) -> &'static str { match self { Kind::File => "file", Kind::Directory => "directory", Kind::Symlink => "symlink", Kind::TreeReference => "tree-reference", } } } #[cfg(feature = "pyo3")] impl<'py> pyo3::IntoPyObject<'py> for Kind { type Target = pyo3::types::PyString; type Output = pyo3::Bound<'py, Self::Target>; type Error = std::convert::Infallible; fn into_pyobject(self, py: pyo3::Python<'py>) -> Result { match self { Kind::File => "file", Kind::Directory => "directory", Kind::Symlink => "symlink", Kind::TreeReference => "tree-reference", } .into_pyobject(py) } } #[cfg(feature = "pyo3")] impl<'a, 'py> pyo3::FromPyObject<'a, 'py> for Kind { type Error = pyo3::PyErr; fn extract(ob: pyo3::Borrowed<'a, 'py, pyo3::PyAny>) -> pyo3::PyResult { let s: String = ob.extract()?; match s.as_str() { "file" => Ok(Kind::File), "directory" => Ok(Kind::Directory), "symlink" => Ok(Kind::Symlink), "tree-reference" => Ok(Kind::TreeReference), _ => Err(pyo3::exceptions::PyValueError::new_err(format!( "Invalid kind: {}", s ))), } } } pub mod chunkreader; #[cfg(unix)] #[path = "mounts-unix.rs"] pub mod mounts; #[cfg(windows)] #[path = "mounts-win32.rs"] pub mod mounts; pub mod path; pub mod sha; pub mod time; bzrformats_3.4.0.orig/crates/osutils-rs/src/mounts-unix.rs0000644000000000000000000002120015162074037020726 0ustar00use lazy_static::lazy_static; use log::{debug, warn}; use std::collections::HashSet; use std::ffi::OsString; use std::fs::File; use std::io::{BufRead, BufReader}; use std::os::unix::ffi::OsStringExt; use std::path::{Path, PathBuf}; pub struct MountEntry { pub path: PathBuf, pub fs_type: String, pub options: String, } // Read a mtab-style file pub fn read_mtab>(path: P) -> impl Iterator { let file = File::open(path).unwrap(); let reader = BufReader::new(file); reader .lines() .filter_map(|line| line.ok()) .filter(|line| !line.starts_with('#')) .filter_map(|line| { let cols: Vec> = line .split_whitespace() .map(|s| s.as_bytes().to_vec()) .collect(); if cols.len() >= 3 { let path = PathBuf::from(OsString::from_vec(cols[1].clone())); let fs_type = String::from_utf8_lossy(&cols[2]).to_string(); let options = String::from_utf8_lossy(&cols[3]).to_string(); Some(MountEntry { path, fs_type, options, }) } else { None } }) } fn sort_mounts(mounts: &mut [MountEntry]) { mounts.sort_by(|a, b| b.path.as_os_str().len().cmp(&a.path.as_os_str().len())); } #[cfg(target_os = "linux")] #[test] fn test_sort_mounts() { let mut mounts = vec![ MountEntry { path: PathBuf::from("/"), fs_type: "ext4".to_string(), options: "rw,relatime,errors=remount-ro".to_string(), }, MountEntry { path: PathBuf::from("/var"), fs_type: "ext4".to_string(), options: "rw,relatime,errors=remount-ro".to_string(), }, MountEntry { path: PathBuf::from("/var/blah"), fs_type: "ext4".to_string(), options: "rw,relatime,errors=remount-ro".to_string(), }, ]; sort_mounts(&mut mounts); assert_eq!( vec!["/var/blah", "/var", "/"], mounts .iter() .map(|m| m.path.to_str().unwrap()) .collect::>() ); } #[cfg(target_os = "linux")] fn load_mounts() -> Vec { let mut mounts: Vec = read_mtab("/proc/mounts").collect(); sort_mounts(&mut mounts); mounts } #[cfg(any(target_os = "macos", target_os = "openbsd"))] fn parse_mount_line(line: &str) -> Option { if line.is_empty() { return None; } if line.starts_with("#") { return None; } let parts: Vec<&str> = line.split_whitespace().collect(); if parts.len() < 3 { return None; } let path = PathBuf::from(parts[2]); let fs_type = parts[0].to_string(); let options = parts[3..].join(" "); Some(MountEntry { path, fs_type, options, }) } #[cfg(any(target_os = "macos", target_os = "openbsd"))] #[test] fn test_parse_mount_line() { let line = "devfs on /dev (devfs, local, nobrowse)"; let mount_entry = parse_mount_line(line).unwrap(); assert_eq!(mount_entry.path, PathBuf::from("/dev")); assert_eq!(mount_entry.fs_type, "devfs"); assert_eq!(mount_entry.options, "local, nobrowse"); } #[cfg(any(target_os = "macos", target_os = "openbsd"))] fn load_mounts() -> Vec { // BSD does not have a /proc/mounts equivalent, so we use the output of // `mount` command // // TODO: find a more robust and efficient way to get mount information let output = std::process::Command::new("mount") .output() .expect("Failed to execute mount command"); let stdout = String::from_utf8_lossy(&output.stdout); let mut mounts = Vec::new(); for line in stdout.lines() { if let Some(mount_entry) = parse_mount_line(line) { mounts.push(mount_entry); } } sort_mounts(&mut mounts); mounts } #[cfg(target_os = "linux")] #[test] fn test_load_mounts() { let mounts = load_mounts(); assert!(!mounts.is_empty()); assert!(mounts[mounts.len() - 1].path == PathBuf::from("/")); } pub fn find_mount_entry>(entries: &[MountEntry], path: P) -> Option<&MountEntry> { entries .iter() .find(|&entry| super::path::is_inside(entry.path.as_path(), path.as_ref())) } lazy_static! { static ref MOUNTS: Vec = load_mounts(); } fn extract_option<'a>(options: &'a str, name: &str) -> Option<&'a str> { for option in options.split(',') { let parts: Vec<&str> = option.split('=').collect(); if parts.len() == 2 && parts[0] == name { return Some(parts[1]); } } warn!("Could not find upperdir in overlay options {:?}", options); None } fn get_fs_type_ext>(entries: &[MountEntry], path: P) -> Option<&str> { let mut seen = HashSet::new(); let mut path = path.as_ref().to_path_buf(); loop { let entry = find_mount_entry(entries, path)?; if entry.fs_type == "overlay" { path = extract_option(&entry.options, "upperdir").map(PathBuf::from)?; if !seen.insert(path.clone()) { warn!("Loop in overlayfs mounts {:?}", seen); return None; } } else { return Some(entry.fs_type.as_str()); } } } #[cfg(target_os = "linux")] #[test] fn test_get_fs_type() { let mounts = vec![MountEntry { path: PathBuf::from("/"), fs_type: "ext4".to_string(), options: "rw,relatime,errors=remount-ro".to_string(), }]; assert!(get_fs_type_ext(&mounts, "/") == Some("ext4")); assert!(get_fs_type_ext(&mounts, "/etc/passwd") == Some("ext4")); } #[cfg(target_os = "linux")] #[test] fn test_get_fs_type_overlay() { let mut mounts = vec![ MountEntry { path: PathBuf::from("/var/blah"), fs_type: "ext4".to_string(), options: "rw,relatime,errors=remount-ro".to_string(), }, MountEntry { path: PathBuf::from("/"), fs_type: "overlay".to_string(), options: "rw,relatime,errors=remount-ro,upperdir=/var/blah".to_string(), }, ]; sort_mounts(&mut mounts); assert_eq!(get_fs_type_ext(&mounts, "/var/blah"), Some("ext4")); assert_eq!(get_fs_type_ext(&mounts, "/"), Some("ext4")); assert_eq!(get_fs_type_ext(&mounts, "/etc/passwd"), Some("ext4")); let mounts = vec![MountEntry { path: PathBuf::from("/"), fs_type: "overlay".to_string(), options: "rw,relatime,errors=remount-ro".to_string(), }]; assert!(get_fs_type_ext(&mounts, "/").is_none()); let mounts = vec![MountEntry { path: PathBuf::from("/"), fs_type: "overlay".to_string(), options: "rw,relatime,errors=remount-ro,upperdir=/foo".to_string(), }]; assert!(get_fs_type_ext(&mounts, "/").is_none()); } pub fn get_fs_type>(path: P) -> Option { get_fs_type_ext(&MOUNTS, path.as_ref()).map(|s| s.to_string()) } pub fn supports_hardlinks>(path: P) -> Option { let fs_type = get_fs_type(path.as_ref())?; match fs_type.as_str() { "ext2" | "ext3" | "ext4" | "btrfs" | "xfs" | "jfs" | "reiserfs" | "zfs" => Some(true), "vfat" | "ntfs" => Some(false), _ => { debug!("Unknown fs type: {}", fs_type); Some(false) } } } pub fn supports_executable>(path: P) -> Option { let fs_type = get_fs_type(path.as_ref())?; match fs_type.as_str() { "vfat" | "ntfs" => Some(false), "ext2" | "ext3" | "ext4" | "btrfs" | "xfs" | "jfs" | "reiserfs" | "zfs" => Some(true), _ => { debug!("Unknown fs type: {}", fs_type); Some(true) } } } pub fn supports_symlinks>(path: P) -> Option { let fs_type = get_fs_type(path.as_ref())?; match fs_type.as_str() { "vfat" | "ntfs" => Some(false), // Maybe? "ext2" | "ext3" | "ext4" | "btrfs" | "xfs" | "jfs" | "reiserfs" | "zfs" => Some(true), _ => { debug!("Unknown fs type: {}", fs_type); Some(true) } } } /// Return True if 'readonly' has POSIX semantics, False otherwise. /// /// Notably, a win32 readonly file cannot be deleted, unlike POSIX where the /// directory controls creation/deletion, etc. /// /// And under win32, readonly means that the directory itself cannot be /// deleted. The contents of a readonly directory can be changed, unlike POSIX /// where files in readonly directories cannot be added, deleted or renamed. pub fn supports_posix_readonly() -> bool { true } bzrformats_3.4.0.orig/crates/osutils-rs/src/mounts-win32.rs0000644000000000000000000000332515162213673020716 0ustar00use std::ffi::OsStr; use std::os::windows::ffi::{OsStrExt, OsStringExt}; use std::path::Path; use std::ptr; use winapi::shared::minwindef::DWORD; use winapi::um::fileapi::GetVolumeInformationW; fn _get_fs_type(drive: &str) -> Option { const MAX_FS_TYPE_LENGTH: DWORD = 16; let mut fs_type = vec![0u16; (MAX_FS_TYPE_LENGTH + 1) as usize]; let res = unsafe { GetVolumeInformationW( OsStr::new(drive) .encode_wide() .chain(std::iter::once(0)) .collect::>() .as_ptr(), ptr::null_mut(), 0, ptr::null_mut(), ptr::null_mut(), ptr::null_mut(), fs_type.as_mut_ptr(), MAX_FS_TYPE_LENGTH, ) }; if res != 0 { let fs_type_os = std::ffi::OsString::from_wide(&fs_type[..]); let fs_type_str = fs_type_os.to_str().unwrap_or_default(); Some(fs_type_str.to_owned()) } else { None } } pub fn get_fs_type>(path: P) -> Option { let drive = path.as_ref().parent().and_then(|p| p.to_str()).unwrap_or_default(); let drive = if drive.contains(':') { drive } else { &format!("{}\\", drive) }; let fs_type = _get_fs_type(drive)?; Some(match fs_type.as_str() { "FAT32" => String::from("vfat"), "NTFS" => String::from("ntfs"), _ => fs_type, }) } pub fn supports_symlinks>(path: P) -> Option { let fs_type = get_fs_type(path)?; match fs_type.as_str() { "ntfs" => Some(true), "vfat" => Some(false), _ => Some(false), } } pub fn supports_posix_readonly() -> bool { false } bzrformats_3.4.0.orig/crates/osutils-rs/src/path.rs0000644000000000000000000000460715162074037017370 0ustar00use std::path::{Path, PathBuf}; use unicode_normalization::{is_nfc, UnicodeNormalization}; pub fn is_inside(dir: &Path, fname: &Path) -> bool { fname.starts_with(dir) } pub fn is_inside_any(dir_list: &[&Path], fname: &Path) -> bool { for dirname in dir_list { if is_inside(dirname, fname) { return true; } } false } pub fn parent_directories(path: &Path) -> impl Iterator { let mut path = path; std::iter::from_fn(move || { if let Some(parent) = path.parent() { path = parent; if path.parent().is_none() { None } else { Some(path) } } else { None } }) } #[derive(Debug)] pub struct InvalidPathSegmentError(pub String); pub fn splitpath(p: &str) -> std::result::Result, InvalidPathSegmentError> { #[cfg(windows)] let split = |c| c == '/' || c == '\\'; #[cfg(not(windows))] let split = |c| c == '/'; let mut rps = Vec::new(); for f in p.split(split) { if f == ".." { return Err(InvalidPathSegmentError(f.to_string())); } else if f == "." || f.is_empty() { continue; } else { rps.push(f); } } Ok(rps) } pub fn accessible_normalized_filename(path: &Path) -> Option<(PathBuf, bool)> { path.to_str().map(|path_str| { if is_nfc(path_str) { (path.to_path_buf(), true) } else { (PathBuf::from(path_str.nfc().collect::()), true) } }) } pub fn inaccessible_normalized_filename(path: &Path) -> Option<(PathBuf, bool)> { path.to_str().map(|path_str| { if is_nfc(path_str) { (path.to_path_buf(), true) } else { let normalized_path = path_str.nfc().collect::(); let accessible = normalized_path == path_str; (PathBuf::from(normalized_path), accessible) } }) } #[cfg(target_os = "macos")] pub fn normalized_filename(path: &Path) -> Option<(PathBuf, bool)> { accessible_normalized_filename(path) } #[cfg(not(target_os = "macos"))] pub fn normalized_filename(path: &Path) -> Option<(PathBuf, bool)> { inaccessible_normalized_filename(path) } pub fn normalizes_filenames() -> bool { #[cfg(target_os = "macos")] return true; #[cfg(not(target_os = "macos"))] return false; } bzrformats_3.4.0.orig/crates/osutils-rs/src/sha.rs0000644000000000000000000000273615162074037017210 0ustar00use sha1::{Digest, Sha1}; use std::fs::File; use std::io::Read; use std::path::Path; pub fn sha_file(f: &mut dyn Read) -> Result { let mut s = Sha1::new(); std::io::copy(f, &mut s)?; Ok(format!("{:x}", s.finalize())) } pub fn size_sha_file(f: &mut dyn Read) -> Result<(usize, String), std::io::Error> { let mut s = Sha1::new(); const BUFSIZE: usize = 128 << 10; let mut buffer = [0; BUFSIZE]; let mut size: usize = 0; loop { let bytes_read = f.read(&mut buffer)?; if bytes_read == 0 { break; } s.update(&buffer[..bytes_read]); size += bytes_read; } Ok((size, format!("{:x}", s.finalize()))) } pub fn size_sha_chunks(chunks: impl Iterator>) -> (usize, String) { let mut s = Sha1::new(); let mut size: usize = 0; for chunk in chunks { s.update(&chunk); size += chunk.len(); } (size, format!("{:x}", s.finalize())) } pub fn sha_file_by_name>(path: P) -> Result { let mut f = File::open(path)?; sha_file(&mut f) } pub fn sha_chunks(strings: I) -> String where I: IntoIterator, S: AsRef<[u8]>, { let mut s = Sha1::new(); for string in strings { s.update(string.as_ref()); } format!("{:x}", s.finalize()) } pub fn sha_string(string: &[u8]) -> String { let mut s = Sha1::new(); s.update(string); format!("{:x}", s.finalize()) } bzrformats_3.4.0.orig/crates/osutils-rs/src/terminal.rs0000644000000000000000000000373115162074037020244 0ustar00use std::io::Read; use std::io::{stdout, Write}; use termion::color::{Bg, Color, Fg, Reset}; use termion::is_tty; pub fn terminal_size() -> std::io::Result<(u16, u16)> { termion::terminal_size() } pub fn has_ansi_colors() -> bool { #[cfg(windows)] { return false; } if !is_tty(&stdout()) { return false; } #[cfg(not(windows))] { use termion::color::DetectColors; use termion::raw::IntoRawMode; match stdout().into_raw_mode() { Ok(mut term) => match term.available_colors() { Ok(count) => count >= 8, Err(_) => false, }, Err(_) => false, } } } pub fn colorstring( text: &[u8], fgcolor: Option, bgcolor: Option, ) -> Vec { let mut ret = Vec::new(); if let Some(color) = fgcolor { ret.write_all(Fg(color).to_string().as_bytes()).unwrap(); } if let Some(color) = bgcolor { ret.write_all(Bg(color).to_string().as_bytes()).unwrap(); } ret.extend_from_slice(text); ret.write_all(Fg(Reset).to_string().as_bytes()).unwrap(); ret.write_all(Bg(Reset).to_string().as_bytes()).unwrap(); ret } #[cfg(unix)] pub fn getchar() -> Result { use std::os::unix::io::AsRawFd; let stdin = std::io::stdin(); let fd = stdin.as_raw_fd(); // Save the current terminal settings let original_termios = termios::Termios::from_fd(fd)?; // Set the terminal to raw mode let mut raw_termios = original_termios; termios::cfmakeraw(&mut raw_termios); termios::tcsetattr(fd, termios::TCSADRAIN, &raw_termios)?; // Read a single character from stdin let mut buffer = [0u8; 1]; stdin.lock().read_exact(&mut buffer)?; // Restore the original terminal settings termios::tcsetattr(fd, termios::TCSADRAIN, &original_termios)?; // Convert the read byte to a char let ch = buffer[0] as char; Ok(ch) } bzrformats_3.4.0.orig/crates/osutils-rs/src/tests.rs0000644000000000000000000000772115162074037017576 0ustar00use crate::chunks_to_lines; use crate::path::{accessible_normalized_filename, inaccessible_normalized_filename}; use std::path::{Path, PathBuf}; fn assert_chunks_to_lines(input: Vec<&str>, expected: Vec<&str>) { let iter = input.iter().map(|l| Ok::<&[u8], String>(l.as_bytes())); let got = chunks_to_lines(iter); let got = got .map(|l| String::from_utf8_lossy(l.unwrap().as_ref()).to_string()) .collect::>(); assert_eq!(got, expected); } #[test] fn test_chunks_to_lines() { assert_chunks_to_lines(vec!["a"], vec!["a"]); assert_chunks_to_lines(vec!["a\n"], vec!["a\n"]); assert_chunks_to_lines(vec!["a\nb\n"], vec!["a\n", "b\n"]); assert_chunks_to_lines(vec!["a\n", "b\n"], vec!["a\n", "b\n"]); assert_chunks_to_lines(vec!["a", "\n", "b", "\n"], vec!["a\n", "b\n"]); assert_chunks_to_lines(vec!["a", "a", "\n", "b", "\n"], vec!["aa\n", "b\n"]); assert_chunks_to_lines(vec![""], vec![]); } #[test] fn test_is_inside() { fn is_inside(path: &str, dir: &str) -> bool { crate::path::is_inside(Path::new(path), Path::new(dir)) } assert!(is_inside("a", "a")); assert!(!is_inside("a", "b")); assert!(is_inside("a", "a/b")); assert!(!is_inside("b", "a/b")); assert!(is_inside("a/b", "a/b")); assert!(!is_inside("a/b", "a/c")); assert!(is_inside("a/b", "a/b/c")); assert!(!is_inside("a/b/c", "a/b")); assert!(is_inside("", "a")); assert!(!is_inside("a", "")); } #[test] fn test_is_inside_any() { fn is_inside_any(path: &str, dirs: &[&str]) -> bool { let dirs = dirs.iter().map(Path::new).collect::>(); crate::path::is_inside_any(dirs.as_slice(), Path::new(path)) } assert!(is_inside_any("a", &["a"])); assert!(!is_inside_any("a", &["b"])); assert!(is_inside_any("a/b", &["a"])); assert!(!is_inside_any("a/b", &["b"])); assert!(is_inside_any("a/b", &["a/b"])); assert!(!is_inside_any("a/b", &["a/c"])); assert!(!is_inside_any("a/b", &["a/b/c"])); assert!(is_inside_any("a/b/c", &["a/b"])); assert!(!is_inside_any("", &["a"])); assert!(is_inside_any("a", &[""])); assert!(is_inside_any("a", &["a", "b"])); assert!(is_inside_any("a", &["b", "a"])); assert!(!is_inside_any("a", &["b", "c"])); } #[test] fn test_is_inside_or_parent_of_any() { fn is_inside_or_parent_of_any(path: &str, dirs: &[&str]) -> bool { let dirs = dirs.iter().map(Path::new).collect::>(); crate::path::is_inside_or_parent_of_any(dirs.as_slice(), Path::new(path)) } assert!(is_inside_or_parent_of_any("a", &["a"])); assert!(!is_inside_or_parent_of_any("a", &["b"])); assert!(is_inside_or_parent_of_any("a/b", &["a"])); assert!(!is_inside_or_parent_of_any("a/b", &["b"])); assert!(is_inside_or_parent_of_any("a/b", &["a/b"])); assert!(!is_inside_or_parent_of_any("a/b", &["a/c"])); assert!(is_inside_or_parent_of_any("a/b", &["a/b/c"])); assert!(is_inside_or_parent_of_any("a/b/c", &["a/b"])); assert!(is_inside_or_parent_of_any("", &["a"])); assert!(is_inside_or_parent_of_any("a", &[""])); assert!(is_inside_or_parent_of_any("a", &["a", "b"])); assert!(is_inside_or_parent_of_any("a", &["b", "a"])); assert!(!is_inside_or_parent_of_any("a", &["b", "c"])); assert!(is_inside_or_parent_of_any("a/b", &["a", "b"])); assert!(is_inside_or_parent_of_any("a/b", &["b", "a"])); } #[test] fn test_inaccessible_normalized_filename() { assert_eq!( inaccessible_normalized_filename(Path::new("a/b")), Some((PathBuf::from("a/b"), true)) ); assert_eq!( inaccessible_normalized_filename(Path::new("a/µ")), Some((PathBuf::from("a/µ"), true)) ); } #[test] fn test_access_normalized_filename() { assert_eq!( accessible_normalized_filename(Path::new("a/b")), Some((PathBuf::from("a/b"), true)) ); assert_eq!( accessible_normalized_filename(Path::new("a/µ")), Some((PathBuf::from("a/µ"), true)) ); } bzrformats_3.4.0.orig/crates/osutils-rs/src/textfile.rs0000644000000000000000000000210515162074037020247 0ustar00use std::fs::File; use std::io::{Error, Read}; use std::path::Path; /// Return false if the supplied lines contain NULs. /// /// Only the first 1024 characters are checked. pub fn check_text_lines(lines: I) -> bool where I: IntoIterator>, { let mut buffer = [0u8; 1024]; let mut offset = 0; for line in lines.into_iter() { if line.iter().any(|&c| c == 0) { return false; } if offset + line.len() > 1024 { break; } buffer[offset..offset + line.len()].copy_from_slice(&line); offset += line.len(); } if buffer[..offset].iter().any(|&c| c == 0) { return false; } true } /// Check whether the supplied path is a text, not binary file. /// /// Raise BinaryFile if a NUL occurs in the first 1024 bytes. pub fn check_text_path>(path: P) -> Result { let file = File::open(path)?; let mut buffer = Vec::new(); let mut handle = file.take(1024); handle.read_to_end(&mut buffer)?; Ok(buffer.iter().all(|&byte| byte != 0)) } bzrformats_3.4.0.orig/crates/osutils-rs/src/time.rs0000644000000000000000000002345415162074037017373 0ustar00use chrono::{DateTime, FixedOffset, Local, NaiveDateTime, TimeZone, Utc}; const DEFAULT_DATE_FORMAT: &str = "%a %Y-%m-%d %H:%M:%S"; pub fn local_time_offset(t: Option) -> i64 { let timestamp = t.unwrap_or_else(|| Utc::now().timestamp()); let local_time: DateTime = Utc .timestamp_opt(timestamp, 0) .unwrap() .with_timezone(&Local); let utc_time: DateTime = Utc.timestamp_opt(timestamp, 0).unwrap(); let local_naive_datetime = local_time.naive_utc(); let utc_naive_datetime = utc_time.naive_utc(); let offset = local_naive_datetime - utc_naive_datetime; offset.num_seconds() } pub fn format_local_date( t: i64, offset: Option, timezone: Timezone, date_fmt: Option<&str>, show_offset: bool, ) -> String { let offset = offset.unwrap_or(0); let tz: FixedOffset = match timezone { Timezone::Utc => FixedOffset::east_opt(0).unwrap(), Timezone::Original => FixedOffset::east_opt(offset).unwrap(), Timezone::Local => *Local::now().offset(), }; let dt: DateTime = tz.timestamp_opt(t, 0).unwrap(); let date_fmt = date_fmt.unwrap_or("%c"); let date_str = dt.format(date_fmt).to_string(); let offset_str = if show_offset { let offset_fmt = if offset < 0 { "%z" } else { "%:z" }; dt.format(offset_fmt).to_string() } else { "".to_string() }; date_str + &offset_str } pub enum Timezone { Local, Utc, Original, } impl Timezone { pub fn from(s: &str) -> Option { match s { "local" => Some(Timezone::Local), "utc" => Some(Timezone::Utc), "original" => Some(Timezone::Original), _ => None, } } } pub fn format_delta(delta: i64) -> String { let mut delta = delta; let direction: &str; if delta >= 0 { direction = "ago"; } else { direction = "in the future"; delta = -delta; } let seconds = delta; if seconds < 90 { if seconds == 1 { return format!("{} second {}", seconds, direction); } else { return format!("{} seconds {}", seconds, direction); } } let mut minutes = seconds / 60; let seconds = seconds % 60; let plural_seconds = if seconds == 1 { "" } else { "s" }; if minutes < 90 { if minutes == 1 { return format!( "{} minute, {} second{} {}", minutes, seconds, plural_seconds, direction ); } else { return format!( "{} minutes, {} second{} {}", minutes, seconds, plural_seconds, direction ); } } let hours = minutes / 60; minutes %= 60; let plural_minutes = if minutes == 1 { "" } else { "s" }; if hours == 1 { format!( "{} hour, {} minute{} {}", hours, minutes, plural_minutes, direction ) } else { format!( "{} hours, {} minute{} {}", hours, minutes, plural_minutes, direction ) } } pub fn format_date_with_offset_in_original_timezone(t: i64, offset: i64) -> String { let offset_hours = offset / 3600; let offset_minutes = (offset % 3600) / 60; let dt = Utc.timestamp_opt(t + offset, 0).unwrap(); let date_str = dt.format(DEFAULT_DATE_FORMAT).to_string(); let offset_str = format!(" {:+03}{:02}", offset_hours, offset_minutes); date_str + &offset_str } pub fn format_date( t: i64, offset: Option, timezone: Timezone, date_fmt: Option<&str>, show_offset: bool, ) -> String { let (dt, offset_str) = match timezone { Timezone::Utc => ( DateTime::from_timestamp(t, 0).expect("timestamp should be valid"), if show_offset { " +0000".to_owned() } else { "".to_owned() }, ), Timezone::Original => { let offset = offset.unwrap_or(0); let offset_str = if show_offset { let sign = if offset >= 0 { '+' } else { '-' }; let hours = offset.abs() / 3600; let minutes = (offset.abs() / 60) % 60; format!(" {}{:02}{:02}", sign, hours, minutes) } else { "".to_owned() }; ( DateTime::from_timestamp(t + offset, 0).expect("timestamp should be valid"), offset_str, ) } Timezone::Local => { let local = Local.timestamp_opt(t, 0).unwrap(); let offset = local.offset().local_minus_utc(); let offset_str = if show_offset { let sign = if offset >= 0 { '+' } else { '-' }; let hours = offset.abs() / 3600; let minutes = (offset.abs() / 60) % 60; format!(" {}{:02}{:02}", sign, hours, minutes) } else { "".to_owned() }; (local.with_timezone(&Utc), offset_str) } }; dt.format(date_fmt.unwrap_or(DEFAULT_DATE_FORMAT)) .to_string() + &offset_str } pub fn format_highres_date(t: f64, offset: Option) -> String { let offset = offset.unwrap_or(0); let datetime = Utc.timestamp_opt(t as i64 + offset as i64, 0).unwrap(); let highres_seconds = format!("{:.9}", t - t.floor())[1..].to_string(); let offset_str = format!(" {:+03}{:02}", offset / 3600, (offset / 60) % 60); format!( "{}{}{}", datetime.format(DEFAULT_DATE_FORMAT), highres_seconds, offset_str ) } const WEEKDAYS: [&str; 7] = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]; pub fn unpack_highres_date(date: &str) -> Result<(f64, i32), String> { let space_loc = date.find(' '); if space_loc.is_none() { return Err(format!( "date string does not contain a day of week: {}", date )); } let weekday = &date[..space_loc.unwrap()]; if !WEEKDAYS.iter().any(|&d| d == weekday) { return Err(format!( "date string does not contain a valid day of week: {}", date )); } let dot_loc = date.find('.'); if dot_loc.is_none() { return Err(format!( "Date string does not contain high-precision seconds: {}", date )); } let base_time_str = &date[space_loc.unwrap() + 1..dot_loc.unwrap()]; let offset_loc = date[dot_loc.unwrap()..].find(' '); if offset_loc.is_none() { return Err(format!("Date string does not contain a timezone: {}", date)); } let fract_seconds_str = &date[dot_loc.unwrap()..dot_loc.unwrap() + offset_loc.unwrap()]; let offset_str = &date[dot_loc.unwrap() + 1 + offset_loc.unwrap()..]; let base_time = NaiveDateTime::parse_from_str(base_time_str, "%Y-%m-%d %H:%M:%S") .map_err(|e| format!("Failed to parse datetime string ({}): {}", base_time_str, e))? .and_utc(); let fract_seconds = fract_seconds_str.parse::().map_err(|e| { format!( "Failed to parse high-precision seconds({}) : {}", fract_seconds_str, e ) })?; let offset = offset_str .parse::() .map_err(|e| format!("Failed to parse offset ({}): {}", offset_str, e))?; let offset_hours = offset / 100; let offset_minutes = offset % 100; let seconds_offset = (offset_hours * 3600) + (offset_minutes * 60); let timestamp = base_time.timestamp() - seconds_offset as i64; let timestamp_with_fract_seconds = timestamp as f64 + fract_seconds; Ok((timestamp_with_fract_seconds, seconds_offset)) } pub fn compact_date(when: u64) -> String { let system_time = Utc.timestamp_opt(when as i64, 0).unwrap(); let date_time: DateTime = system_time; date_time.format("%Y%m%d%H%M%S").to_string() } #[cfg(test)] mod tests { /// Assert osutils.format_delta formats as expected. fn assert_formatted_delta(expected: &str, seconds: i64) { let actual = super::format_delta(seconds); assert_eq!(expected, actual); } #[test] fn test_format_delta() { assert_formatted_delta("0 seconds ago", 0); assert_formatted_delta("1 second ago", 1); assert_formatted_delta("10 seconds ago", 10); assert_formatted_delta("59 seconds ago", 59); assert_formatted_delta("89 seconds ago", 89); assert_formatted_delta("1 minute, 30 seconds ago", 90); assert_formatted_delta("3 minutes, 0 seconds ago", 180); assert_formatted_delta("3 minutes, 1 second ago", 181); assert_formatted_delta("10 minutes, 15 seconds ago", 615); assert_formatted_delta("30 minutes, 59 seconds ago", 1859); assert_formatted_delta("31 minutes, 0 seconds ago", 1860); assert_formatted_delta("60 minutes, 0 seconds ago", 3600); assert_formatted_delta("89 minutes, 59 seconds ago", 5399); assert_formatted_delta("1 hour, 30 minutes ago", 5400); assert_formatted_delta("2 hours, 30 minutes ago", 9017); assert_formatted_delta("10 hours, 0 minutes ago", 36000); assert_formatted_delta("24 hours, 0 minutes ago", 86400); assert_formatted_delta("35 hours, 59 minutes ago", 129599); assert_formatted_delta("36 hours, 0 minutes ago", 129600); assert_formatted_delta("36 hours, 0 minutes ago", 129601); assert_formatted_delta("36 hours, 1 minute ago", 129660); assert_formatted_delta("36 hours, 1 minute ago", 129661); assert_formatted_delta("84 hours, 10 minutes ago", 303002); // We handle when time steps the wrong direction because computers // don"t have synchronized clocks. assert_formatted_delta("84 hours, 10 minutes in the future", -303002); assert_formatted_delta("1 second in the future", -1); assert_formatted_delta("2 seconds in the future", -2); } } bzrformats_3.4.0.orig/doc/btree_index_prefetch.txt0000644000000000000000000003373115162203117017341 0ustar00==================== BTree Index Prefetch ==================== This document outlines how we decide to pre-read extra nodes in the btree index. Rationale ========= Because of the latency involved in making a request, it is often better to make fewer large requests, rather than more small requests, even if some of the extra data will be wasted. Example ------- Using my connection as an example, I have a max bandwidth of 160kB/s, and a latency of between 100-400ms to London, I'll use 200ms for this example. With this connection, in 200ms you can download 32kB. So if you make 10 requests for 4kB of data, you spend 10*.2s = 2s sending the requests, and 4*10/160 = .25s actually downloading the data. If, instead, you made 3 requests for 32kB of data each, you would take 3*.2s = .6s for requests, and 32*3/160 = .6s for downloading the data. So you save 2.25 - 1.2 = 1.05s even though you downloaded 32*3-4*10 = 56kB of data that you probably don't need. On the other hand, if you made 1 request for 480kB, you would take .2s for the request, and 480/160=3s for the data. So you end up taking 3.2s, because of the wasted 440kB. BTree Structure =============== This is meant to give a basic feeling for how the btree index is laid out on disk, not give a rigorous discussion. For that look elsewhere[ref?]. The basic structure is that we have pages of 4kB. Each page is either a leaf, which holds the final information we are interested in, or is an internal node, which contains a list of references to the next layer of nodes. The layers are structured such that all nodes for the top layer come first, then the nodes for the next layer, linearly in the file. Example 1 layer --------------- In the simplest example, all the data fits into a single page, the root node. This means the root node is a leaf node. Example 2 layer --------------- As soon as the data cannot fit in a single node, we create a new internal node, make that the root, and start to create multiple leaf nodes. The root node then contains the keys which divide the leaf pages. (So if leaf node 1 ends with 'foo' and leaf node 2 starts with 'foz', the root node would hold the key 'foz' at position 0). Example 3 layer --------------- It is possible for enough leaf nodes to be created, that we cannot fit all there references in a single node. In this case, we again split, creating another layer, and setting that as the root. This layer then references the intermediate layer, which references the final leaf nodes. In all cases, the root node is a single page wide. The next layer can have 2-N nodes. Current Info ------------ Empirically, we've found that the number of references that can be stored on a page varies from about 60 to about 180, depending on how much we compress, and how similar the keys are. Internal nodes also achieve approximately the same compression, though they seem to be closer to 80-100 and not as variable. For most of this discussion, we will assume each page holds 100 entries, as that makes the math nice and clean. So the idea is that if you have <100 keys, they will probably all fit on the root page. If you have 100 - 10,000 keys, we will have a 2-layer structure, if you have 10,000 - 1,000,000 keys, you will have a 3-layer structure. 10^6-10^8 will be 4-layer, etc. Data and Request ================ It is important to be aware of what sort of data requests will be made on these indexes, so that we know how to optimize them. This is still a work in progress, but generally we are searching through ancestry. The final information (in the leaf nodes) is stored in sorted order. Revision ids are generally of the form "prefix:committer@email-timestamp-randomtail". This means that revisions made by the same person around the same time will be clustered, but revisions made by different people at the same time will not be clustered. For files, the keys are ``(file-id, revision-id)`` tuples. And file-ids are generally ``basename-timestamp-random-count`` (depending on the converter). This means that all revisions for a given file-id will be grouped together, and that files with similar names will be grouped together. However, files committed in the same revisions will not be grouped together in the index.[1]_ .. [1] One interesting possibility would be to change file-ids from being 'basename-...', to being 'containing-dirname-filename-...', which would group files in the similarly named directories together. In general, we always start with a request for the root node of the index, as it tells us the final structure of the rest of the index. How many total pages, what pages are internal nodes and what layer, which ones are leaves. Before this point, we do know the *size* of the index, because that is stored in the ``pack-names`` file. Thoughts on expansion ===================== This is just a bullet list of things to consider when expanding a request. * We generally assume locality of reference. So if we are currently reading page 10, we are more likely to read page 9 or 11 than we are page 20. * However, locality of reference only really holds within a layer. If we are reading the last node in a layer, we are unlikely to read the first node of the next layer. In fact, we are most likely to read the *last* node of the next layer. More directly, we are probably equally likely to read any of the nodes in the next layer, which could be referred to by this layer. So if we have a structure of 1 root node, 100 intermediate nodes, and 10,000 leaf nodes. They will have offsets: 0, 1-101, 102-10,102. If we read the root node, we are likely to want any of the 1-101 nodes (because we don't know where the key points). If we are reading node 90, then we are likely to want a node somewhere around 9,100-9,200. * When expanding a request, we are considering that we probably want to read on the order of 10 pages extra. (64kB / 4kB = 16 pages.) It is unlikely that we want to expand the requests by 100. * At the moment, we assume that we don't have an idea of where in the next layer the keys might fall. We *could* use a predictive algorithm assuming homogenous distribution. When reading the root node, we could assume an even distribution from 'a-z', so that a key starting with 'a' would tend to fall in the first few pages of the next layer, while a key starting with 'z' would fall at the end of the next layer. However, this is quite likely to fail in many ways. Specific examples: * Converters tend to use an identical prefix. So all revisions will start with 'xxx:', leading us to think that the keys fall in the last half, when in reality they fall evenly distributed. * When looking in text indexes. In the short term, changes tend to be clustered around a small set of files. Short term changes are unlikely to cross many pages, but it is unclear what happens in the mid-term. Obviously in the long term, changes have happened to all files. A possibility, would be to use this after reading the root node. And then using an algorithm that compares the keys before and after this record, to find what a distribution would be, and estimate the next pages. This is a lot of work for a potentially small benefit, though. * When checking for N keys, we do sequential lookups in each layer. So we look at layer 1 for all N keys, then in layer 2 for all N keys, etc. So our requests will be clustered by layer. * For projects with large history, we are probably more likely to end up with a bi-modal distribution of pack files. Where we have 1 pack file with a large index, and then several pack files with small indexes, several with tiny indexes, but no pack files with medium sized indexes. This is because a command like ``bzr pack`` will combine everything into a single large file. Commands like ``bzr commit`` will create an index with a single new record, though these will be packaged together by autopack. Commands like ``bzr push`` and ``bzr pull`` will create indexes with more records, but these are unlikely to be a significant portion of the history. Consider bzr has 20,000 revisions, a single push/pull is likely to only be 100-200 revisions, or 1% of the history. Note that there will always be cases where things are evenly distributed, but we probably shouldn't *optimize* for that case. * 64kB is 16 pages. 16 pages is approximately 1,600 keys. * We are considering an index with 1 million keys to be very large. 10M is probably possible, and maybe 100M, but something like 1 billion keys is unlikely. So a 3-layer index is fairly common (it exists already in bzr), but a 4-layer is going to be quite rare, and we will probably never see a 5-layer. * There are times when the second layer is going to be incompletely filled out. Consider an index with 101 keys. We found that we couldn't fit everything into a single page, so we expanded the btree into a root page and a leaf page, and started a new leaf page. However, the root node only has a single entry. There are 3 pages, but only one of them is "full". This happens again when we get near the 10,000 node barrier. We found we couldn't fit the index in a single page, so we split it into a higher layer, and 1 more sub-layer. So we have 1 root node, 2 layer-2 nodes, and N leaf nodes (layer 3). If we read the first 3 nodes, we will have read all internal nodes. It is certainly possible to detect this for the first-split case (when things no-longer fit into just the root node), as there will only be a few nodes total. Is it possible to detect this from only the 'size' information for the second-split case (when the index no longer fits in a single page, but still fits in only a small handful of pages)? This only really works for the root + layer 2. For layers 3+ they will always be too big to read all at once. However, until we've read the root, we don't know the layout, so all we have to go on is the size of the index, though that also gives us the explicit total number of pages. So it doesn't help to read the root page and then decide. However, on the flip side, if we read *before* the split, then we don't gain much, as we are reading pages we aren't likely to be interested in. For example: We have 100 keys, which fits onto 100 pages, with a single root node. At 1,100 keys, it would be 101 leaf pages, which would then cause us to need 2 index pages, triggering an extra layer. However, this is very sensitive to the number of keys we fit per-page, which depends on the compression. Although, we could consider 2,000 keys. Which would be 200 leaf nodes, and 2 intermediate nodes, and a single root node. It is unlikely that we would ever be able to fit 200 references into a single root node. So if we pretend that we split at 1 page, 100 pages, and 10,000 pages. We might be able to say, at 1-5 pages, read all pages, for 5-100 pages, read only the root. At 100 - 500 pages, read 1-5 pages, for 500-10,000 read only the root. At 10,000-50,000 read 1-5 pages again, but above 50,000 read only the root. We could bias this a bit smaller, say at powers of 80, instead of powers of 100, etc. The basic idea is that if we are *close* to a layer split, go ahead and read a small number of extra pages. * The previous discussion applies whenever we have an upper layer that is not completely full. So the pages referenced by the last node from the upper layer will often not have a full 100-way fan out. Probably not worthwhile very often, though. * Sometimes we will be making a very small request for a very small number of keys, we don't really want to bloat tiny requests. Hopefully we can find a decent heuristic to determine when we will be wanting extra nodes later, versus when we expect to find all we want right now. Algorithm ========= This is the basic outline of the algorithm. 1. If we don't know the size of the index, don't expand as we don't know what is available. (This only really applies to the pack-names file, which is unlikely to ever become larger than 1 page anyway.) 2. If a request is already wide enough to be greater than the number of recommended pages, don't bother trying to expand. This only really happens with LocalTransport which recommends a single page. 3. Determine what pages have already been read (if any). If the pages left to read can fit in a single request, just request them. This tends to happen on medium sized indexes (ones with low hundreds of revisions), and near the end when we've read most of the whole index already. 4. If we haven't read the root node yet, and we can't fit the whole index into a single request, only read the root node. We don't know where the layer boundaries are anyway. 5. If we haven't read "tree depth" pages yet, and are only requesting a single new page don't expand. This is meant to handle the 'lookup 1 item in the index' case. In a large pack file, you'll read only a single page at each layer and then be done. When spidering out in a search, this will cause us to take a little bit longer to start expanding, but once we've started we'll be expanding at full velocity. This could be improved by having indexes inform each other that they have already entered the 'search' phase, or by having a hint from above to indicate the same. However, remember the 'bi-modal' distribution. Most indexes will either be very small, or very large. So either we'll read the whole thing quickly, or we'll end up spending a lot of time in the index. Which makes a small number of extra round trips to large indexes a small overhead. For 2-layer nodes, this only 'wastes' one round trip. 6. Now we are ready to expand the requests. Expand by looking for more pages next to the ones requested that fit within the current layer. If you run into a cached page, or a layer boundary, search further only in the opposite direction. This gives us proper locality of reference, and also helps because when a search goes in a single direction, we will continue to prefetch pages in that direction. .. vim: ft=rst tw=79 ai bzrformats_3.4.0.orig/doc/bundle-format4.txt0000644000000000000000000002554015162203117016013 0ustar00============================================ Merge Directive format 2 and Bundle format 4 ============================================ :Date: 2007-06-21 Motivation ---------- Merge Directive format 2 represents a request to perform a certain merge. It provides access to all the data necessary to perform that merge, by including a branch URL or a bundle payload. It typically will include a preview of what applying the patch would do. Bundle Format 4 is designed to be a compact format for storing revision metadata that can be generated quickly and installed into a repository efficiently. It is not intended to be human-readable. Note ---- These two formats, taken together, can be viewed as the successor of Bundle format 0.9, so their specifications are combined. It is expected that in the future, bundle and merge-directive formats will vary independently. Bundle Format Name ------------------ This is the fourth bundle format to see public use. Previous versions were 0.7, 0.8, and 0.9. Only 0.7's version number was aligned with a Bazaar release. Dependencies ------------ - Container format 1 - Multiparent diffs - Bencode - Patch-RIO Description ----------- Merge Directives fulfil the role previous bundle formats had of requesting a merge to be performed, but are a more flexible way of doing so. With the introduction of these two formats, there is a clear split between "directive", which is a request to merge (and therefore signable), and "bundle", which is just data. Merge Directive format 2 may provide a patch preview of the change being requested. If a preview is supplied, the receiving client will verify that the actual change matches the preview. Merge Directive format 2 also includes a testament hash, to ensure that if a branch is used, the branch cannot be subverted to cause the wrong changes to be applied. Bundle format 4 is designed to trade human-readability for speed and compactness. It does not contain a human-readable "prelude" patch. Merge Directive 2 Contents -------------------------- This format consists of three sections, in the following order. Patch-RIO command section ~~~~~~~~~~~~~~~~~~~~~~~~~ This section is identical to the corresponding section in Format 1 merge directives, except as noted below. It is mandatory. It is terminated by a line reading ``#`` that is not preceeded by a line ending with ``\``. In order to support cherry-picking and patch comparison, this format adds a new piece of information, the ``base_revision_id``. This is a suggested base revision for merging. It may be supplied by the user. If not, it is calculated using the standard merge base algorithm, with the ``revision_id`` and target branch's ``last_revision`` as its inputs. When merging, clients should use the ``base_revision_id`` when it is not already present in the ancestry of the ``last_revision`` of the target branch. If it is already present, clients should calculate a merge base in the normal way. Patch preview section ~~~~~~~~~~~~~~~~~~~~~ This section is optional. It begins with the line ``# Begin patch``. It is terminated by the end-of-file or by the beginning of a bundle section. Its contents are a unified diff, as per the ``bzr diff`` command. The FROM revision is the ``base_revision_id`` specified in the Patch-RIO section. Bundle section ~~~~~~~~~~~~~~ This section is optional, but if it is not supplied, a source_branch must be supplied. It begins with the line ``# Begin bundle``, and is terminated by the end-of-file. The contents are a base-64 encoded bundle. This may be any bundle format, but formats 4+ are strongly recommended. The base revision is the newest revision in the source branch which is an ancestor of all revisions not present in target which are ancestors of revision_id. This base revision may or may not be the same as the ``base_revision_id``. In particular, the ``base_revision_id`` may specify a cherry-pick, but all the ancestors of the ``base_revision_id`` should be installed in the target repository before performing such a merge. Bundle 4 Contents ----------------- Bazaar revision bundles begin with a format marker that reads ``# Bazaar revision bundle v4`` in plaintext. The remainder of the file is a ``Bazaar pack format 1`` container. The container is compressed using bzip2. Putting the format marker in plaintext ensures that old clients will give good diagnostics, but renders the file unreadable by standard bzip2 utilities. Serialization ~~~~~~~~~~~~~ Format 4 records revision and inventory records in their repository serialization format. This minimizes translation and compression costs in the common case, where the sender and receiver use the same serialization format for their repository. Steps have been taken to ensure a faithful conversion when serialization formats are mismatched. Bundle Records ~~~~~~~~~~~~~~ The bundle format creates a single bundle-level record out of two container records. The first container record contains metainfo as a Bencoded dict. The second container record contains the body. The bundle record name is associated with the metainfo record. The body record is anonymous. Record metainfo ~~~~~~~~~~~~~~~ :record_kind: The storage strategy of the record. May be ``fulltext`` (the record body contains the full text of the value), ``mpdiff`` (the record body contains a multi-parent diff of the value), or ``header`` (no record body). :parents: Used in fulltext and mpdiff records. The revisions that should be noted as parents of this revision in the repository. For mpdiffs, this is also the list of build-parents. :sha1: Used in mpdiff records. The sha-1 hash of the full-text value. Bundle record naming ~~~~~~~~~~~~~~~~~~~~~ All bundle records have a single name, which is associated with the metainfo container record. Records are named according to the body's content-kind, revision-id, and file-id. Content-kind may be one of: :file: a version of a user file :inventory: the tree inventory :revision: the revision metadata for a revision :signature: the revision signature for a revision Names are constructed like so: ``content-kind/revision-id/file-id``. Values are iterpreted left-to-right, so if two values are present, they are content-kind and revision-id. A record has a file-id if-and-only-if it is a file record. Info records have no revision or file-id. Inventory, revision and signature all have content-kind and revision-id, but no file-id. Layout ~~~~~~ The first record is an info/header record. The subsequent records are mpdiff file records. The are ordered first by file id, then in topological order by revision-id. The next records are mpdiff inventory records. They are topologically sorted. The next records are revision and signature fulltexts. They are interleaved and topologically sorted. Info record ~~~~~~~~~~~ The info record has type ``header``. It has no revision_id or file_id. Its metadata contains: :serializer: A string describing the serialization format used for inventory and revision data. May be ``xml5``, ``xml6`` or ``xml7``. :supports_rich_root: 1 if the source repository supports rich roots, 0 otherwise. Implementation notes ~~~~~~~~~~~~~~~~~~~~ - knit deltas contain almost enough information to extract the original SequenceMatcher.get_matching_blocks() call used to produce them. Combining that information with the relevant fulltexts allows us to avoid performing sequence matching on any fulltexts for which we have deltas. - MultiParent deltas contain ``get_matching_blocks`` output almost verbatim, but if there is more than one parent, the information about the leftmost parent may be incomplete. However, for single-parent multiparent diffs, we can extract the ``SequenceMatcher.get_matching_blocks`` output, and therefore ``the SequenceMatcher.get_opcodes`` output used to create knit deltas. Installing data across serialization mismatches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In practice, there cannot be revision serialization mismatches, because the serialization of revisions has been consistent in serializations 5-7 If there is a mismatch in inventory serialization formats, the receiver can 1. extract the inventory objects for the parents 2. serialize them using the bundle serialize 3. apply the mpdiff 4. calculate the fulltext sha1 5. compare the calculated sha1 to the expected sha1 6. deserialize using the bundle serializer 7. serialize using the repository serializer 8. add to the repository This is much slower, of course. But since the since the fulltext is verified at step 5, it should be just as safe as any other conversion. Model differences ~~~~~~~~~~~~~~~~~ Note that there may be model differences requiring additional changes. These differences are described by the "supports_rich_root" value in the info record. A subset of xml6 and xml7 records are compatible with xml5 (i.e. those that were converted from xml5 originally). When installing from a bundle whose serializer supports tree references to a repository that does not support tree references, clients should halt if they encounter a record containing a tree reference. When installing from a supports_rich_root bundle to a repository that does not support rich roots, clients should halt if they encounter an inventory record whose root directory revision-id does not match the inventory revision id. When installing from a bundle that does not support rich roots to a repository that does, additional knits should be added for the root directory, with a revision for each inventory revision. Validating preview patches ~~~~~~~~~~~~~~~~~~~~~~~~~~ When applying a merge directive that includes a preview, clients should verify that the preview matches the changes requested by the merge directive. In order to do this, the client should generate a diff from the ``base_revision_id`` to the ``revision_id``. This diff should be compared against the preview patch, making allowances for the fact that whitespace munging may have occurred. One form of whitespace munging that has been observed is line-ending conversion. Certain mail clients such as Evolution do not respect the line-endings of text attachments. Since line-ending conversion is unlikely to alter the meaning of a patch, it seems safe to ignore line endings when comparing the preview patch. Another form of whitespace munging that has been observed is trailing-whitespace stripping. Again, it seems unlikely that stripping trailing whitespace could alter the meaning of a patch. Such a distinction is also invisible to readers, so ignoring it does not create a new threat. So it seems reasonable to ignore trailing whitespace when comparing the patches. Other mungings are possible, but it is recommended not to implement support for them until they have been observed. Each of these changes makes the comparison more approximate, and the more approximate it becomes, the easier it is to provide a preview patch that does not match the requested changes. bzrformats_3.4.0.orig/doc/bundles.txt0000644000000000000000000000550015162203117014616 0ustar00======= Bundles ======= Status ====== :Date: 2007-06-19 This document describes the current and future design of the bzr bundle facility. .. contents:: Motivation ========== Bundles are intended to be a compact binary representation of the changes done within a branch for transmission between users. Bundles should be able to be used easily and seamlessly - we want to avoid having a parallel set of commands to get data from within a bundle. A related concept is **merge directives** which are used to transmit bzr merge and merge-like operations from one user to another in such a way that the recipient can be sure they get the correct data the initiator desired. Desired features ================ * A bundle should be able to substitute for the entire branch in any bzr command that operates on branches in a read only fashion. * Bundles should be as small as possible without losing data to keep them feasible for including in emails. Historical Design ================= Not formally documented, the current released implementation can be found in bzrlib.bundle.serializer. One key element is that this design included parts of the branch data as human readable diffs; which were then subject to corruption by transports such as email. June 2007 Design ================ `Bundle Format 4 spec`_ .. _Bundle Format 4 spec: bundle-format4.html Future Plans ============ Bundles will be implemented as a 'Shallow Branch' with the branch and repository data combined into a single file. This removes the need to special case bundle handling for all command which read from branches. Physical encoding ----------------- Bundles will be encoded using the bzr pack format. Within the pack the branch metadata will be serialised as a BzrMetaDir1 branch entry. The Repository data added by the revisions contained in the bundle will be encoded using multi parent diffs as they are the most pithy diffs we are able to create today in the presence of merges. XXX More details needed? Code reuse ---------- Ideally we can reuse our BzrMetaDir based branch formats directly within a Bundle by layering a Transport interface on top of the pack - or just copying the data out into a readonly memory transport when we read the pack. This suggests we will have a pack specific Control instance, replacing the usual 'BzrDir' instance, but use the Branch class as-is. For the Repository access, we will create a composite Repository using the planned Repository Stacking API, and a minimal Repository implementation that can work with the multi parent diffs within the bundle. We will need access to a branch that has the basis revision of the bundle to be able to construct revisions from within it - this is a requirement for Shallow Branches too, so hopefully we can define a single mechanism at the Branch level to gain access to that. .. vim: ft=rst tw=74 ai bzrformats_3.4.0.orig/doc/container-format.txt0000644000000000000000000001677515162203117016452 0ustar00================ Container format ================ Status ====== :Date: 2007-06-07 This document describes the proposed container format for streaming and storing collections of data in Bazaar. Initially this will be used for streaming revision data for incremental push/pull in the smart server for 0.18, but the intention is that this will be the basis for much more than just that use case. In particular, this document currently focuses almost exclusively on the streaming case, and not the on-disk storage case. It also does not discuss the APIs used to manipulate containers and their records. .. contents:: Motivation ========== To create a low-level file format which is suitable for solving the smart server latency problem and whose layout and requirements are extendable in future versions of Bazaar, and with no requirements that the smart server does not have today. Terminology =========== A **container** is a streamable file that contains a series of **records**. Records may have **names**, and consist of bytes. Use Cases ========= Here's a brief description of use cases this format is intended to support. Streaming data between a smart server and client ------------------------------------------------ It would be nice if we could combine multiple containers into a single stream by something no more expensive than concatenation (e.g. by omitting end/start marker pairs). This doesn't imply that such a combination necessarily produces a valid container (e.g. care must be taken to ensure that names are still unique in the combined container), or even a useful container. It is simply that the cost of assembling a new combined container is practically as cheap as simple concatenation. Incremental push or pull ~~~~~~~~~~~~~~~~~~~~~~~~ Consider the use case of incremental push/pull, which is currently (0.16) very slow on high-latency links due to the large number of round trips. What we'd like is something like the following. A client will make a request meaning "give me the knit contents for these revision IDs" (how the client determines which revision IDs it needs is unimportant here). In response, the server streams a single container of: * one record per file-id:revision-id knit gzip contents and graph data, * one record per inventory:revision-id knit gzip contents and graph data, * one record per revision knit gzip contents, * one record per revision signature, * end marker record. in that order. Persistent storage on disk -------------------------- We want a storage format that allows lock-free writes, which suggests a format that uses *rename into place*, and *do not modify after writing*. Usable before deep model changes to Bazaar ------------------------------------------ We want a format we can use and refine sooner rather than later. So it should be usable before the anticipated model changes for Bazaar "1.0" land, while not conflicting with those changes either. Specifically, we'd like to have this format in Bazaar 0.18. Examples of possible record content ----------------------------------- * full texts of file versions * deltas of full texts * revisions * inventories * inventory as tree items e.g. the inventory data for 20 files * revision signatures * per-file graph data * annotation cache Characteristics =============== Some key aspects of the described format are discussed in this section. No length-prefixing of entire container --------------------------------------- The overall container is not length-prefixed. Instead there is an end marker so that readers can determine when they have read the entire container. This also does not conflict with the goal of allowing single-pass writing. Structured as a self-contained series of records ------------------------------------------------ The container contains a series of *records*. Each record is self-delimiting. Record markers are lightweight. The overhead in terms of bytes and processing for records in this container vs. the raw contents of those records is minimal. Addressing records ------------------ There is a requirement that each object can be given an arbitrary name. Some version control systems address all content by the SHA-1 digest of that content, but this scheme is unsatisfactory for Bazaar's revision objects. We can still allow addressing by SHA-1 digest for those content types where it makes sense. Some proposed object names: * to name a revision: "``revision:``\ *revision-id*". e.g., `revision:pqm@pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5`. * to name an inventory delta: "``inventory.delta:``\ *revision-id*". e.g., `inventory.delta:pqm@pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5`. It seems likely that we may want to have multiple names for an object. This format allows that (by allowing multiple ``name`` headers in a Bytes record). Although records are in principle addressable by name, this specification alone doesn't provide for efficient access to a particular record given its name. It is intended that separate indexes will be maintained to provide this. It is acceptable to have records with no explicit name, if the expected use of them does not require them. For example: * a record's content could be self-describing in the context of a particular container, or * a record could be accessed via an index based on SHA-1, or * when streaming, the first record could be treated specially. Reasonably cheap for small records ---------------------------------- The overhead for storing fairly short records (tens of bytes, rather than thousands or millions) is minimal. The minimum overhead is 3 bytes plus the length of the decimal representation of the *length* value (for a record with no name). Specification ============= This describes just a basic layer for storing a simple series of "records". This layer has no intrinsic understanding of the contents of those records. The format is: * a **container lead-in**, "``Bazaar pack format 1 (introduced in 0.18)\n``", * followed by one or more **records**. A record is: * a 1 byte **kind marker**. * 0 or more bytes of record content, depending on the record type. Record types ------------ End Marker ~~~~~~~~~~ An **End Marker** record: * has a kind marker of "``E``", * no content bytes. End Marker records signal the end of a container. Bytes ~~~~~ A **Bytes** record: * has a kind marker of "``B``", * followed by a mandatory **content length** [1]_: "*number*\ ``\n``", where *number* is in decimal, e.g:: 1234 * followed by zero or more optional **names**: "*name*\ ``\n``", e.g.:: revision:pqm@pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5 * followed by an **end of headers** byte: "``\n``", * followed by some **bytes**, exactly as many as specified by the length prefix header. So a Bytes record is a series of lines encoding the length and names (if any) followed by a body. For example, this is a possible Bytes record (including the kind marker):: B26 example-name1 example-name2 abcdefghijklmnopqrstuvwxyz Names ----- Names should be UTF-8 encoded strings, with no whitespace. Names should be unique within a single container, but no guarantee of uniqueness outside of the container is made by this layer. Names need to be at least one character long. .. [1] This requires that the writer of a record knows the full length of the record up front, which typically means it will need to buffer an entire record in memory. For the first version of this format this is considered to be acceptable. .. vim: ft=rst tw=74 ai bzrformats_3.4.0.orig/doc/dirstate.txt0000644000000000000000000000471215162203117015005 0ustar00Dirstate ======== Don't really need the hashes of the current versions - just knowing whether they've changed or not will generally be enough - and just the mtime and ctime of a point in time may be enough? ``_dirblock_state`` ------------------- There are currently 4 levels that state can have. 1. NOT_IN_MEMORY The actual content blocks have not been read at all. 2. IN_MEMORY_UNMODIFIED The content blocks have been read and are available for use. They have not been changed at all versus what was written on disk when we read them. 3. IN_MEMORY_HASH_MODIFIED We have updated the in-memory state, but only to record the sha1/symlink target value and the stat value that means this information is 'fresh'. 4. IN_MEMORY_MODIFIED We have updated an actual record. (Parent lists, added a new file, deleted something, etc.) In this state, we must always write out the dirstate, or some user action will be lost. IN_MEMORY_HASH_MODIFIED ~~~~~~~~~~~~~~~~~~~~~~~ This state is a bit special, so deserves its own topic. If we are IN_MEMORY_HASH_MODIFIED, we only write out the dirstate if enough records have been updated. The idea is that if we would save future I/O by writing an updated dirstate, then we should do so. The threshold for this is set by "worth_saving_limit". The default is that at least 10 entries must be updated in order to consider the dirstate file worth updating. Going one step further, newly added files, symlinks, and directory entries updates are treated specially. We know that we will always stat all entries in the tree so that we can observe *if* they have changed. In the case of directories, all the information we know about them is just from that stat value. There is no extra content to read. So an update directory entry doesn't cause us to update to IN_MEMORY_HASH_MODIFIED. However, if there are other modifications worth saving, we will go ahead and save the directory entry update at the same time. Similarly, symlink targets are commonly stored in the inode entry directly. So once we have stat'ed the symlink, we already have its target information in memory. The one caveat is if we used to think an object was a file, and it became a directory or symlink, then we will treat it as worth saving. In the case of newly added files, we never have to read their content to know that they are different from the basis tree. So saving the updated information also won't save a future read. .. vim: ft=rst tw=74 et bzrformats_3.4.0.orig/doc/groupcompress-design.txt0000644000000000000000000001265515162203117017352 0ustar00This document contains notes about the design for groupcompress, replacement VersionedFiles store for use in pack based repositories. The goal is to provide fast, history bounded text extraction. Overview ++++++++ The goal: Much tighter compression, maintained automatically. Considerations to weigh: The minimum IO to reconstruct a text with no other repository involved; The number of index lookups to plan a reconstruction. The minimum IO to reconstruct a text with another repositories assistance (affects network IO for fetch, which impacts incremental pulls and shallow branch operations). Current approach ================ Each delta is individually compressed against another text, and then entropy compressed. We index the pointers between these deltas. Solo reconstruction: Plan a readv via the index, read the deltas in forward IO, apply each delta. Total IO: sum(deltas) + deltacount*index overhead. Fetch/stacked reconstruction: Plan a readv via the index, using local basis texts where possible. Then readv locally and remote and apply deltas. Total IO as for solo reconstruction. Things to keep ============== Reasonable sizes 'amount read' from remote machines to reconstruct an arbitrary text: Reading 5MB for a 100K plain text is not a good trade off. Reading (say) 500K is probably acceptable. Reading ~100K is ideal. However, it's likely that some texts (e.g NEWS versions) can be stored for nearly-no space at all if we are willing to have unbounded IO. Profiling to set a good heuristic will be important. Also allowing users to choose to optimise for a server environment may make sense: paying more local IO for less compact storage may be useful. Things to remove ================ Index scatter gather IO. Doing hundreds or thousands of index lookups is very expensive, and doing that per file just adds insult to injury. Partioned compression amongst files. Scatter gather IO when reconstructing texts: linear forward IO is better. Thoughts ======== Merges combine texts from multiple versions to create a new version. Deltas add new text to existing files and remove some text from the same. Getting high compression means reading some base and then a chain of deltas (could be a tree) to gain access to the thing that the final delta was made against, and that delta. Rather than composing all these deltas, we can just just perform the final diff against the base text and the serialised invidual deltas. If the diff algorithm can reuse out of order lines from previous texts (e.g. storing AB -> BA as pointers rather than delete and add, then the presence of any previously stored line in a single chain can be reused. One such diff algorithm is xdelta, another reasonable one to consider is plain old zlib or lzma. We could also use bzip2. One advantage of using a generic compression engine is less python code. One advantage of preprocessing line based deltas is that we reduce the window size for the text repeated within lines, and that will help compression by a simple entropy compressor as a post processor. lzma appears fantastic at compression - 420MB of NEWS files down to 200KB. so window size appears to be a key determiner for efficiency. Delta strategy ++++++++++++++ Very big objects - no delta. I plan to kick this in at 5MB initially, but once the codebase is up and running, we can tweak this to Very small objects - no delta? If they are combined with a larger zlib object why not? (Answer: because zlib's window is really small) Other objects - group by fileid (gives related texts a chance, though using a file name would be better long term as e.g. COPYING and COPYING from different projects could combine). Then by reverse topological graph(as this places more recent texts at the front of a chain). Alternatively, group by size, though that should not matter with a large enough window. Finally, delta the texts against the current output of the compressor. This is essentially a somewhat typed form of sliding window dictionary compression. An alternative implementation would be to just use zlib, or lzma, or bzip2 directory. Unfortunately, just using entropy compression forces a lot of data to be output by the decompressor - e.g. 420MB in the NEWS sample corpus. When we only want a single 55K text thats inefficient. (An initial test took several seconds with lzma.) The fastest to implement approach is probably just 'diff output to date and add to entropy compressor'. This should produce reasonable results. As delta chain length is not a concern (only one delta to apply ever), we can simply cap the chain when the total read size becomes unreasonable. Given older texts are smaller we probably want some weighted factor of plaintext size. In this approach, a single entropy compressed region is read as a unit, giving the lower bound for IO (and how much to read is an open question - what byte offset of compressed data is sufficient to ensue that the delta-stream contents we need are reconstructable. Flushing, while possible, degrades compression(and adds overhead - we'd be paying 4 bytes per record guaranteed). Again - tests will be needed. A nice possibility is to output mpdiff compatible records, which might enable some code reuse. This is more work than just diff (current_out, new_text), so can wait for the concept to be proven. Implementation Strategy +++++++++++++++++++++++ Bring up a VersionedFiles object that implements this, then stuff it into a repository format. zlib as a starting compressor, though bzip2 will probably do a good job. bzrformats_3.4.0.orig/doc/improved_chk_index.txt0000644000000000000000000006011515162203117017026 0ustar00=================== CHK Optimized index =================== Our current btree style index is nice as a general index, but it is not optimal for Content-Hash-Key based content. With CHK, the keys themselves are hashes, which means they are randomly distributed (similar keys do not refer to similar content), and they do not compress well. However, we can create an index which takes advantage of these abilites, rather than suffering from them. Even further, there are specific advantages provided by ``groupcompress``, because of how individual items are clustered together. Btree indexes also rely on zlib compression, in order to get their compact size, and further has to try hard to fit things into a compressed 4k page. When the key is a sha1 hash, we would not expect to get better than 20bytes per key, which is the same size as the binary representation of the hash. This means we could write an index format that gets approximately the same on-disk size, without having the overhead of ``zlib.decompress``. Some thought would still need to be put into how to efficiently access these records from remote. Required information ==================== For a given groupcompress record, we need to know the offset and length of the compressed group in the .pack file, and the start and end of the content inside the uncompressed group. The absolute minimum is slightly less, but this is a good starting point. The other thing to consider, is that for 1M revisions and 1M files, we'll probably have 10-20M CHK pages, so we want to make sure we have an index that can scale up efficiently. 1. A compressed sha hash is 20-bytes 2. Pack files can be > 4GB, we could use an 8-byte (64-bit) pointer, or we could store a 5-byte pointer for a cap at 1TB. 8-bytes still seems like overkill, even if it is the natural next size up. 3. An individual group would never be longer than 2^32, but they will often be bigger than 2^16. 3 bytes for length (16MB) would be the minimum safe length, and may not be safe if we expand groups for large content (like ISOs). So probably 4-bytes for group length is necessary. 4. A given start offset has to fit in the group, so another 4-bytes. 5. Uncompressed length of record is based on original size, so 4-bytes is expected as well. 6. That leaves us with 20+8+4+4+4 = 40 bytes per record. At the moment, btree compression gives us closer to 38.5 bytes per record. We don't have perfect compression, but we also don't have >4GB pack files (and if we did, the first 4GB are all under then 2^32 barrier :). If we wanted to go back to the ''minimal'' amount of data that we would need to store. 1. 8 bytes of a sha hash are generally going to be more than enough to fully determine the entry (see `Partial hash`_). We could support some amount of collision in an index record, in exchange for resolving it inside the content. At least in theory, we don't *have* to record the whole 20-bytes for the sha1 hash. (8-bytes gives us less than 1 in 1000 chance of a single collision for 10M nodes in an index) 2. We could record the start and length of each group in a separate location, and then have each record reference the group by an 'offset'. This is because we expect to have many records in the same group (something like 10k or so, though we've fit >64k under some circumstances). At a minimum, we have one record per group so we have to store at least one reference anyway. So the maximum overhead is just the size and cost of the dereference (and normally will be much much better than that.) 3. If a group reference is an 8-byte start, and a 4-byte length, and we have 10M keys, but get at least 1k records per group, then we would have 10k groups. So we would need 120kB to record all the group offsets, and then each individual record would only need a 2-byte group number, rather than a 12-byte reference. We could be safe with a 4-byte group number, but if each group is ~1MB, 64k groups is 64GB. We can start with 2-byte, but leave room in the header info to indicate if we have more than 64k group entries. Also, current grouping creates groups of 4MB each, which would make it 256GB, to create 64k groups. And our current chk pages compress down to less than 100 bytes each (average is closer to 40 bytes), which for 256GB of raw data, would amount to 2.7 billion CHK records. (This will change if we start to use CHK for text records, as they do not compress down as small.) Using 100 bytes per 10M chk records, we have 1GB of compressed chk data, split into 4MB groups or 250 total groups. Still << 64k groups. Conversions could create 1 chk record at a time, creating a group for each, but they would be foolish to not commit a write group after 10k revisions (assuming 6 CHK pages each). 4. We want to know the start-and-length of a record in the decompressed stream. This could actually be moved into a mini-index inside the group itself. Initial testing showed that storing an expanded "key => start,offset" consumed a considerable amount of compressed space. (about 30% of final size was just these internal indices.) However, we could move to a pure "record 1 is at location 10-20", and then our external index would just have a single 'group entry number'. There are other internal forces that would give a natural cap of 64k entries per group. So without much loss of generality, we could probably get away with a 2-byte 'group entry' number. (which then generates an 8-byte offset + endpoint as a header in the group itself.) 5. So for 1M keys, an ideal chk+group index would be: a. 6-byte hash prefix b. 2-byte group number c. 2-byte entry in group number d. a separate lookup of 12-byte group number to offset + length e. a variable width mini-index that splits X bits of the key. (to maintain small keys, low chance of collision, this is *not* redundant with the value stored in (a)) This should then dereference into a location in the index. This should probably be a 4-byte reference. It is unlikely, but possible, to have an index >16MB. With an 10-byte entry, it only takes 1.6M chk nodes to do so. At the smallest end, this will probably be a 256-way (8-bits) fan out, at the high end it could go up to 64k-way (16-bits) or maybe even 1M-way (20-bits). (64k-way should handle up to 5-16M nodes and still allow a cheap <4k read to find the final entry.) So the max size for the optimal groupcompress+chk index with 10M entries would be:: 10 * 10M (entries) + 64k * 12 (group) + 64k * 4 (mini index) = 101 MiB So 101MiB which breaks down as 100MiB for the actual entries, 0.75MiB for the group records, and 0.25MiB for the mini index. 1. Looking up a key would involve: a. Read ``XX`` bytes to get the header, and various config for the index. Such as length of the group records, length of mini index, etc. b. Find the offset in the mini index for the first YY bits of the key. Read the 4 byte pointer stored at that location (which may already be in the first content if we pre-read a minimum size.) c. Jump to the location indicated, and read enough bytes to find the correct 12-byte record. The mini-index only indicates the start of records that start with the given prefix. A 64k-way index resolves 10MB records down to 160 possibilities. So at 12 bytes each, to read all would cost 1920 bytes to be read. d. Determine the offset for the group entry, which is the known ``start of groups`` location + 12B*offset number. Read its 12-byte record. e. Switch to the .pack file, and read the group header to determine where in the stream the given record exists. At this point, you have enough information to read the entire group block. For local ops, you could only read enough to get the header, and then only read enough to decompress just the content you want to get at. Using an offset, you also don't need to decode the entire group header. If we assume that things are stored in fixed-size records, you can jump to exactly the entry that you care about, and read its 8-byte (start,length in uncompressed) info. If we wanted more redundancy we could store the 20-byte hash, but the content can verify itself. f. If the size of these mini headers becomes critical (8 bytes per record is 8% overhead for 100 byte records), we could also compress this mini header. Changing the number of bytes per entry is unlikely to be efficient, because groups standardize on 4MiB wide, which is >>64KiB for a 2-byte offset, 3-bytes would be enough as long as we never store an ISO as a single entry in the content. Variable width also isn't a big win, since base-128 hits 4-bytes at just 2MiB. For minimum size without compression, we could only store the 4-byte length of each node. Then to compute the offset, you have to sum all previous nodes. We require <64k nodes in a group, so it is up to 256KiB for this header, but we would lose partial reads. This should still be cheap in compiled code (needs tests, as you can't do partial info), and would also have the advantage that fixed width would be highly compressible itself. (Most nodes are going to have a length that fits 1-2 bytes.) An alternative form would be to use the base-128 encoding. (If the MSB is set, then the next byte needs to be added to the current value shifted by 7*n bits.) This encodes 4GiB in 5 bytes, but stores 127B in 1 byte, and 2MiB in 3 bytes. If we only stored 64k entries in a 4 MiB group, the average size can only be 64B, which fits in a single byte length, so 64KiB for this header, or only 1.5% overhead. We also don't have to compute the offset of *all* nodes, just the ones before the one we want, which is the similar to what we have to do to get the actual content out. Partial Hash ============ The size of the index is dominated by the individual entries (the 1M records). Saving 1 byte there saves 1MB overall, which is the same as the group entries and mini index combined. If we can change the index so that it can handle collisions gracefully (have multiple records for a given collision), then we can shrink the number of bytes we need overall. Also, if we aren't going to put the full 20-bytes into the index, then some form of graceful handling of collisions is recommended anyway. The current structure does this just fine, in that the mini-index dereferences you to a "list" of records that start with that prefix. It is assumed that those would be sorted, but we could easily have multiple records. To resolve the exact record, you can read both records, and compute the sha1 to decide between them. This has performance implications, as you are now decoding 2x the records to get at one. The chance of ``n`` texts colliding with a hash space of ``H`` is generally given as:: 1 - e ^(-n^2 / 2 H) Or if you use ``H = 2^h``, where ``h`` is the number of bits:: 1 - e ^(-n^2 / 2^(h+1)) For 1M keys and 4-bytes (32-bit), the chance of collision is for all intents and purposes 100%. Rewriting the equation to give the number of bits (``h``) needed versus the number of entries (``n``) and the desired collision rate (``epsilon``):: h = log_2(-n^2 / ln(1-epsilon)) - 1 The denominator ``ln(1-epsilon)`` == ``-epsilon``` for small values (even @0.1 == -0.105, and we are assuming we want a much lower chance of collision than 10%). So we have:: h = log_2(n^2/epsilon) - 1 = 2 log_2(n) - log_2(epsilon) - 1 Given that ``epsilon`` will often be very small and ``n`` very large, it can be more convenient to transform it into ``epsilon = 10^-E`` and ``n = 10^N``, which gives us:: h = 2 * log_2(10^N) - 2 log_2(10^-E) - 1 h = log_2(10) (2N + E) - 1 h ~ 3.3 (2N + E) - 1 Or if we use number of bytes ``h = 8H``:: H ~ 0.4 (2N + E) This actually has some nice understanding to be had. For every order of magnitude we want to increase the number of keys (at the same chance of collision), we need ~1 byte (0.8), for every two orders of magnitude we want to reduce the chance of collision we need the same extra bytes. So with 8 bytes, you can have 20 orders of magnitude to work with, 10^10 keys, with guaranteed collision, or 10 keys with 10^-20 chance of collision. Putting this in a different form, we could make ``epsilon == 1/n``. This gives us an interesting simplified form:: h = log_2(n^3) - 1 = 3 log_2(n) - 1 writing ``n`` as ``10^N``, and ``H=8h``:: h = 3 N log_2(10) - 1 =~ 10 N - 1 H ~ 1.25 N So to have a one in a million chance of collision using 1 million keys, you need ~59 bits, or slightly more than 7 bytes. For 10 million keys and a one in 10 million chance of any of them colliding, you can use 9 (8.6) bytes. With 10 bytes, we have a one in a 100M chance of getting a collision in 100M keys (substituting back, the original equation says the chance of collision is 4e-9 for 100M keys when using 10 bytes.) Given that the only cost for a collision is reading a second page and ensuring the sha hash actually matches we could actually use a fairly "high" collision rate. A chance of 1 in 1000 that you will collide in an index with 1M keys is certainly acceptible. (note that isn't 1 in 1000 of those keys will be a collision, but 1 in 1000 that you will have a *single* collision). Using a collision chance of 10^-3, and number of keys 10^6, means we need (12+3)*0.4 = 6 bytes. For 10M keys, you need (14+3)*0.4 = 6.8 aka 7. We get that extra byte from the ``mini-index``. In an index with a lot of keys, you want a bigger fan-out up front anyway, which gives you more bytes consumed and extends your effective key width. Also taking one more look at ``H ~ 0.4 (2N + E)``, you can rearrange and consider that for every order of magnitude more keys you insert, your chance for collision goes up by 2 orders of magnitude. But for 100M keys, 8 bytes gives you a 1 in 10,000 chance of collision, and that is gotten at a 16-bit fan-out (64k-way), but for 100M keys, we would likely want at least 20-bit fan out. You can also see this from the original equation with a bit of rearranging:: epsilon = 1 - e^(-n^2 / 2^(h+1)) epsilon = 1 - e^(-(2^N)^2 / (2^(h+1))) = 1 - e^(-(2^(2N))(2^-(h+1))) = 1 - e^(-(2^(2N - h - 1))) Such that you want ``2N - h`` to be a very negative integer, such that ``2^-X`` is thus very close to zero, and ``1-e^0 = 0``. But you can see that if you want to double the number of source texts, you need to quadruple the number of bits. Scaling Sizes ============= Scaling up ---------- We have said we want to be able to scale to a tree with 1M files and 1M commits. With a 255-way fan out for chk pages, you need 2 internal nodes, and a leaf node with 16 items. (You maintain 2 internal nodes up until 16.5M nodes, when you get another internal node, and your leaf nodes shrink down to 1 again.) If we assume every commit averages 10 changes (large, but possible, especially with large merges), then you get 1 root + 10*(1 internal + 1 leaf node) per commit or 21 nodes per commit. At 1M revisions, that is 21M chk nodes. So to support the 1Mx1M project, we really need to consider having up to 100M chk nodes. Even if you went up to 16M tree nodes, that only bumps us up to 31M chk nodes. Though it also scales by number of changes, so if you had a huge churn, and had 100 changes per commit and a 16M node tree, you would have 301M chk nodes. Note that 8 bytes (64-bits) in the prefix still only gives us a 0.27% chance of collision (1 in 370). Or if you had 370 projects of that size, with all different content, *one* of them would have a collision in the index. We also should consider that you have the ``(parent_id,basename) => file_id`` map that takes up its own set of chk pages, but testing seems to indicate that it is only about 1/10th that of the ``id_to_entry`` map. (rename,add,delete are much less common then content changes.) As a point of reference, one of the largest projects today OOo, has only 170k revisions, and something less than 100k files (and probably 4-5 changes per commit, but their history has very few merges, being a conversion from CVS). At 100k files, they are probably just starting to hit 2-internal nodes, so they would end up with 10 pages per commit (as a fair-but-high estimate), and at 170k revs, that would be 1.7M chk nodes. Scaling down ------------ While it is nice to scale to a 16M files tree with 1M files (100M total changes), it is also important to scale efficiently to more *real world* scenarios. Most projects will fall into the 255-64k file range, which is where you have one internal node and 255 leaf nodes (1-2 chk nodes per commit). And a modest number of changes (10 is generally a high figure). At 50k revisions, that would give you 50*2*10=500k chk nodes. (Note that all of python has 303k chk nodes, all of launchpad has 350k, mysql-5.1 in gc255 rather than gc255big had 650k chk nodes, [depth=3].) So for these trees, scaling to 1M nodes is more than sufficient, and allows us to use a 6-byte prefix per record. At a minimum, group records could use a 4-byte start and 3-byte length, but honestly, they are a tiny fraction of the overall index size, and it isn't really worth the implementation cost of being flexible here. We can keep a field in the header for the group record layout (8, 4) and for now just assert that this size is fixed. Other discussion ================ group encoding -------------- In the above scheme we store the group locations as an 8-byte start, and 4-byte length. We could theoretically just store a 4-byte length, and then you have to read all of the groups and add them up to determine the actual start position. The trade off is a direct jump-to-location versus storing 3x the data. Given when you have 64k groups you will need only .75MiB to store it, versus the 120MB for the actual entries, this seems to be no real overhead. Especially when you consider that 10M chk nodes should fit in only 250 groups, so total data is actually only 3KiB. Then again, if it was only 1KiB it is obvious that you would read the whole thing in one pass. But again, see the pathological "conversion creating 1 group per chk page" issue. Also, we might want to support more than 64k groups in a given index when we get to the point of storing file content in a CHK index. A lot of the analysis about the number of groups is based on the 100 byte compression of CHK nodes, which would not be true with file-content. We should compress well, I don't expect us to compress *that* well. Launchpad shows that the average size of a content record is about 500-600 bytes (after you filter out the ~140k that are NULL content records). At that size, you expect to get approx 7k records per group, down from 40k. Going further, though, you also want to split groups earlier, since you end up with better compression. so with 100,000 unique file texts, you end up with ~100 groups. With 1M revisions @ 10 changes each, you have 10M file texts, and would end up at 10,485 groups. That seems like more 64k groups is still more than enough head room. You need to fit only 100 entries per group, to get down to where you are getting into trouble (and have 10M file texts.) Something to keep an eye on, but unlikely to be something that is strictly a problem. Still reasonable to have a record in the header indicating that index entries use a 2-byte group entry pointer, and allow it to scale to 3 (we may also find a win scaling it down to 1 in the common cases of <250 groups). Note that if you have the full 4MB groups, it takes 256 GB of compressed content to fill 64k records. And our groups are currently scaled that we require at least 1-2MB before they can be considered 'full'. variable length index entries ----------------------------- The above had us store 8-bytes of sha hash, 2 bytes of group number, and 2 bytes for record-in-group. However, since we have the variable-pointer mini-index, we could consider having those values be 'variable length'. So when you read the bytes between the previous-and-next record, you have a parser that can handle variable width. The main problem is that to encode start/stop of record takes some bytes, and at 12-bytes for a record, you don't have a lot of space to waste for a "end-of-entry" indicator. The easiest would be to store things in base-128 (high bit indicates the next byte also should be included). storing uncompressed offset + length ------------------------------------ To get the smallest index possible, we store only a 2-byte 'record indicator' inside the index, and then assume that it can be decoded once we've read the actual group. This is certainly possible, but it represents yet another layer of indirection before you can actually get content. If we went with variable-length index entries, we could probably get most of the benefit with a variable-width start-of-entry value. The length-of-content is already being stored as a base128 integer starting at the second byte of the uncompressed data (the first being the record type, fulltext/delta). It complicates some of our other processing, since we would then only know how much to decompress to get the start of the record. Another intriguing possibility would be to store the *end* of the record in the index, and then in the data stream store the length and type information at the *end* of the record, rather than at the beginning (or possibly at both ends). Storing it at the end is a bit unintuitive when you think about reading in the data as a stream, and figuring out information (you have to read to the end, then seek back) But a given GC block does store the length-of-uncompressed-content, which means we can trivially decompress, jump to the end, and then walk-backwards for everything else. Given that every byte in an index entry costs 10MiB in a 10M index, it is worth considering. At 4MiB for a block, base 128 takes 4 bytes to encode the last 50% of records (those beyond 2MiB), 3 bytes for everything from 16KiB => 2MiB. So the expected size is for all intents and purposes, 3.5 bytes. (Just due to an unfortunate effect of where the boundary is that you need more bytes.) If we capped the data at 2MB, the expected drops to just under 3 bytes. Note that a flat 3bytes could decode up to 16MiB, which would be much better for our purpose, but wouldn't let us write groups that had a record after 16MiB, which doesn't work for the ISO case. Though it works *absolutely* fine for the CHK inventory cases (what we have today). null content ------------ At the moment, we have a lot of records in our per-file graph that refers to empty content. We get one for every symlink and directory, for every time that they change. This isn't specifically relevant for CHK pages, but for efficiency we could certainly consider setting "group = 0 entry = 0" to mean that this is actually a no-content entry. It means the group block itself doesn't have to hold a record for it, etc. Alternatively we could use "group=FFFF entry = FFFF" to mean the same thing. ``VF.keys()`` ------------- At the moment, some apis expect that you can list the references by reading all of the index. We would like to get away from this anyway, as it doesn't scale particularly well. However, with this format, we no longer store the exact value for the content. The content is self describing, and we *would* be storing enough to uniquely decide which node to read. Though that is actually contained in just 4-bytes (2-byte group, 2-byte group entry). We use ``VF.keys()`` during 'pack' and 'autopack' to avoid asking for content we don't have, and to put a counter on the progress bar. For the latter, we can just use ``index.key_count()`` for the former, we could just properly handle ``AbsentContentFactory``. More than 64k groups -------------------- Doing a streaming conversion all at once is still something to consider. As it would default to creating all chk pages in separate groups (300-400k easily). However, just making the number of group block entries variable, and allowing the pointer in each entry to be variable should suffice. At 3 bytes for the group pointer, we can refer to 16.7M groups. It does add complexity, but it is likely necessary to allow for arbitrary cases. .. vim: ft=rst tw=78 ai bzrformats_3.4.0.orig/doc/index.txt0000644000000000000000000000312015162203117014265 0ustar00Bazaar/Breezy Format Specifications =================================== This directory contains documentation about the various data formats used by Bazaar/Breezy for storing and transmitting version control data. Contents -------- .. toctree:: :maxdepth: 1 bundle-format4 bundles container-format dirstate btree_index_prefetch improved_chk_index indices inventory packrepo repository repository-stream groupcompress-design Overview -------- These documents describe the on-disk and network formats used by Bazaar/Breezy: * **Bundle Formats**: Formats for transmitting revisions as bundles - :doc:`bundles` - Bundle facility design - :doc:`bundle-format4` - Bundle format 4 and Merge Directive format 2 * **Repository Formats**: Storage formats for revision data - :doc:`repository` - Repository services and pack-based repositories - :doc:`packrepo` - KnitPack repository format - :doc:`groupcompress-design` - Groupcompress format design - :doc:`repository-stream` - Repository streaming format * **Working Tree Formats**: Formats for working tree metadata - :doc:`dirstate` - Dirstate format for working trees - :doc:`inventory` - Inventory formats and serialization * **Index Formats**: Indexing structures for efficient data access - :doc:`indices` - Indexing facilities - :doc:`btree_index_prefetch` - BTree index format and prefetching - :doc:`improved_chk_index` - CHK optimized index format * **Container Format**: General-purpose container for streaming data - :doc:`container-format` - Container format specificationbzrformats_3.4.0.orig/doc/indices.txt0000644000000000000000000000624715162203117014611 0ustar00======= Indices ======= Status ====== :Date: 2007-07-14 This document describes the indexing facilities within breezy. .. contents:: Motivation ========== To provide a clean concept of index that can be reused by different components within the codebase rather than being rewritten every time by different components. Terminology =========== An **index** is a dictionary mapping opaque keys to opaque values. Different index types may allow some of the value data to be interpreted by the index. For example the ``GraphIndex`` index stores a graph between keys as part of the index. Overview ======== Breezy is moving to a write-once model for repository storage in order to achieve lock-free repositories eventually. In order to support this, we are making our new index classes **immutable**. That is, one creates a new index in a single operation, and after that it is read only. To combine two indices a ``Combined*`` index may be used, or an **index merge** may be performed by reading the entire value of two (or more) indices and writing them into a new index. General Index API ================= We may end up with multiple different Index types (e.g. GraphIndex, Index, WhackyIndex). Even though these may require different method signatures to operate would strive to keep the signatures and return values as similar as possible. e.g.:: GraphIndexBuilder - add_node(key, value, references) IndexBuilder - add_node(key, value) WhackyIndexBuilder - add_node(key, value, whackiness) as opposed to something quite different like:: node = IncrementalBuilder.get_node() node.key = 'foo' node.value = 'bar' Services -------- An initial implementation of indexing can probably get away with a small number of primitives. Assuming we have write once index files: Build index ~~~~~~~~~~~ This should be done by creating an ``IndexBuilder`` and then calling ``insert(key, value)`` many times. (Indices that support sorting, topological sorting etc, will want specialised insert methods). When the keys have all been added, a ``finish`` method should be called, which will return a file stream to read the index data from. Retrieve entries from the index ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This should allow random access to the index using readv, so we probably want to open the index on a ``Transport``, then use ``iter_entries(keys)``, which can return an iterator that yields ``(key, value)`` pairs in whatever order makes sense for the index. Merging of indices ~~~~~~~~~~~~~~~~~~ Merging of N indices requires a concordance of the keys of the index. So we should offer a ``iter_all_entries`` call that has the same return type as the ``iter_entries`` call. Index implementations ===================== GraphIndex ---------- ``GraphIndex`` supports graph based lookups. While currently unoptimised for reading, the index is quite space efficient at storing the revision graph index for Breezy. The ``GraphIndexBuilder`` may be used to create one of these indices by calling ``add_node`` until all nodes are added, then ``finish`` to obtain a file stream containing the index data. Multiple indices may be queried using the ``CombinedGraphIndex`` class. .. vim: ft=rst tw=74 ai bzrformats_3.4.0.orig/doc/inventory.txt0000644000000000000000000006246615162203117015235 0ustar00=========== Inventories =========== .. contents:: Overview ======== Inventories provide an abstraction for talking about the shape of a tree. Generally only tree object implementors should be concerned about entire inventory objects and their implementation. Other common exceptions are full-tree operations such as 'checkout', 'export' and 'import'. In memory inventories ===================== In memory inventories are often used in diff and status operations between trees. We are working to reduce the number of times this occurs with 'full tree' inventory objects, and instead use more custom tailored data structures that allow operations on only a small amount of data regardless of the size of the tree. Serialization ============= There are several variants of serialised tree shape in use by Breezy. To date these have been mostly XML-based, though plugins have offered non-XML versions. dirstate -------- The dirstate file in a working tree includes many different tree shapes - one for the working tree and one for each parent tree, interleaved to allow efficient diff and status operations. XML --- All the XML serialized forms write to and read from a single byte string, whose hash is then the inventory validator for the commit object. Serialization scaling and future designs ======================================== Overall efficiency and scaling is constrained by the bottom level structure that an inventory is stored as. We have a number of goals we want to achieve: 1. Allow commit to write less than the full tree's data in to the repository in the general case. 2. Allow the data that is written to be calculated without examining every versioned path in the tree. 3. Generate the exact same representation for a given inventory regardless of the amount of history available. 4. Allow in memory deltas to be generated directly from the serialised form without upcasting to a full in-memory representation or examining every path in the tree. Ideally the work performed will be proportional to the amount of changes between the trees being compared. 5. Allow fetch to determine the file texts that need to be pulled to ensure that the entire tree can be reconstructed without having to probe every path in the tree. 6. Allow Breezy to map paths to file ids without reading the entire serialised form. This is something that is used by commands such as merge PATH and diff -r X PATH. 7. Let Breezy map file ids to paths without reading the entire serialised form. This is used by commands that are presenting output to the user such as loggerhead, brz-search, log FILENAME. 8. We want a strong validator for inventories which is cheap to generate. Specifically we should be able to create the generator for a new commit without processing all the data of the basis commit. 9. Testaments generation is currently size(tree), we would like to create a new testament standard which requires less work so that signed commits are not significantly slower than regular commits. We have current performance and memory bugs in log -v, merge, commit, diff -r, loggerhead and status -r which can be addressed by an inventory system meeting these goals. Current situation ----------------- The XML-based implementation we use today layers the inventory as a bytestring which is stored under a single key; the bytestring is then compressed as a delta against the bytestring of its left hand parent by the knit code. Gap analysis: 1. Succeeds 2. Fails - generating a new XML representation needs full tree data. 3. Succeeds - the inventory layer accesses the bytestring, which is deterministic 4. Fails - we have to reconstruct both inventories as trees and then delta the resulting in memory objects. 5. Partial success - the revision field in the inventory can be scanned for in both text-delta and full-bytestring form; other revision values than those revisions which are being pulled are by definition absent. 6. Partially succeeds - with appropriate logic a path<->id map can be generated just-in-time, but it is complex and still requires reconstructing the entire byte-string. 7. As for 6. 8. Fails - we have to hash the entire tree in serialised form to generate validators. 9. Fails. Long term work -------------- Some things are likely harder to fix incrementally than others. In particular, goal 3 (constant canonical form) is arguably only achieved if we remove all derived data such as the last-modified revision from the inventory itself. That said, the last-modified appears to be in a higher level than raw serialization. So in the medium term we will not alter the contents of inventories, only the way that the current contents are mapped to and from disk. Layering -------- We desire clear and clean layers. Each layer should be as simple as we can make it to aid in debugging and performance tuning. So where we can choose to either write a complex layer and something simple on top of it, or two layers with neither being as complex - then we should consider the latter choice better in the absence of compelling reasons not to. Some key layers we have today and can look at using or tweaking are: * Tree objects - the abstract interface breezy code works in * VersionedFiles - the optionally delta compressing key->bytes storage interface. * Inventory - the abstract interface that many tree operations are written in. These layers are probably sufficient with minor tweaking. We may want to add additional modules/implementations of one or more layers, but that doesn't really require new layers to be exposed. Design elements to achieve the goals in a future inventory implementation ------------------------------------------------------------------------- * Split up the logical document into smaller serialised fragements. For instance hash buckets or nodes in a tree of some sort. By serialising in smaller units, we can increase the number of smaller units rather than their size as the tree grows; as long as two similar trees have similar serialised forms, the amount of different content should be quite high. * Use fragment identifiers that are independent of revision id, so that serialisation of two related trees generates overlap in the keyspace for fragments without requiring explicit delta logic. Content Hash Keys (e.g. ('sha1:ABCDEF0123456789...',) are useful here because of the ability to assign them without reference to history.) * Store the fragments in our existing VersionedFiles store. Adding an index for them. Have the serialised form be uncompressed utf8, so that delta logic in the VersionedFiles layer can be used. We may need to provide some sort of hinting mechanism to get good compression - but the trivially available zlib compression of knits-with-no-deltas is probably a good start. * Item_keys_introduced_by is innately a history-using function; we can reproduce the text-key finding logic by doing a tree diff between any tree and an older tree - that will limit the amount of data we need to process to something proportional to the difference and the size of each fragment. When checking many versions we can track which fragments we have examined and only look at new unique ones as each version is examined in turn. * Working tree to arbitrary history revision deltas/comparisons can be scaled up by doing a two-step (fixed at two!) delta combining - delta(tree, basis) and then combine that with delta(basis, arbitrary_revision) using the repositories ability to get a delta cheaply. * The key primitives we need seem to be: * canonical_form(inventory) -> fragments * delta(inventory, inventory) -> inventory_delta * apply(inventory_delta, canonical_form) -> fragments * Having very many small fragments is likely to cause a high latency multiplier unless we are careful. * Possible designs to investigate - a hash bucket approach, radix trees, B+ trees, directory trees (with splits inside a directory?). Hash bucket based inventories ============================= Overview -------- We store two maps - fileid:inventory_entry and path:fileid, in a stable hash trie, stored in densly packed fragments. We pack keys into the map densely up the tree, with a single canonical form for any given tree. This is more stable than simple fixed size buckets, which prevents corner cases where the tree size varies right on a bucket size border. (Note that such cases are not a fatal flaw - the two forms would both be present in the repository, so only a small amount of data would be written at each transition - but a full tree reprocess would be needed at each tree operation across the boundary, and thats undesirable.) Goal satisfaction ----------------- 1. Success 2. Success 3. Success 4. Success, though each change will need its parents looked up as well so it will be proportional to the changes + the directories above the changed path. 5. Success - looking at the difference against all parents we can determine new keys without reference to the repository content will be inserted into. 6. This probably needs a path->id map, allowing a 2-step lookup. 7. If we allocate buckets by hashing the id, then this is succeed, though, as per 4 it will need recursive lookups. 8. Success 9. Fail - data beyond that currently included in testaments is included in the strong validator. Issues ------ 1. Tuning the fragment size needs doing. 1. Testing. 1. Writing code. 1. Separate root node, or inline into revision? 1. Cannot do 'ls' efficiently in the current design. 1. Cannot detect invalid deltas easily. 1. What about LCA merge of inventories? Canonical form -------------- There are three fragment types for the canonical form. Each fragment is addressed using a Content Hash Key (CHK) - for instance "sha1:12345678901234567890". root_node: (Perhaps this should be inlined into the revision object). HASH_INVENTORY_SIGNATURE path_map: CHK to root of path to id map content_map: CHK to root of id to entry map map_node: INTERNAL_NODE or LEAF_NODE INTERNAL_NODE: INTERNAL_NODE_SIGNATURE hash_prefix: PREFIX prefix_width: INT PREFIX CHK TYPE SIZE PREFIX CHK TYPE SIZE ... (Where TYPE is I for internal or L for leaf). leaf_node: LEAF_NODE_SIGNATURE hash_prefix: PREFIX HASH\x00KEY\x00 VALUE For path maps, VALUE is:: fileid For content maps, VALUE:: fileid basename kind last-changed kind-specific-details The path and content maps are populated simply by serialising every inventory entry and inserting them into both the path map and the content map. The maps start with just a single leaf node with an empty prefix. Apply ----- Given an inventory delta - a list of (old_path, new_path, InventoryEntry) items, with a None in new_path indicating a delete operation, and recursive deletes not being permitted - all entries to be deleted must be explicitly listed, we can transform a current inventory directly. We can't trivially detect an invalid delta though. To perform an application, naively we can just update both maps. For the path map we would remove all entries where the paths in the delta do not match, then insert those with a new_path again. For the content map we would just remove all the fileids in the delta, then insert those with a new_path that is not None. Delta ----- To generate a delta between two inventories, we first generate a list of altered fileids, and then recursively look up their parents to generate their old and new file paths. To generate the list of altered file ids, we do an entry by entry comparison of the full contents of every leaf node that the two inventories do not have in common. To do this, we start at the root node, and follow every CHK pointer that is only in one tree. We can then bring in all the values from the leaf nodes and do a set difference to get the altered ones, which we would then parse. Radix tree based inventories ============================ Overview -------- We store two maps - fileid:path and path:inventory_entry. The fileid:path map is a hash trie (as file ids have no useful locality of reference). The path:inventory_entry map is stored as a regular trie. As for hash tries we define a single canonical representation for regular tries similar to that defined above for hash tries. Goal satisfaction ----------------- 1. Success 2. Success 3. Success 4. Success 5. Success - looking at the difference against all parents we can determine new keys without reference to the repository content will be inserted into. 6. Success 7. Success 8. Success 9. Fail - data beyond that currently included in testaments is included in the strong validator. Issues ------ 1. Tuning the fragment size needs doing. 1. Testing. 1. Writing code. 1. Separate root node, or inline into revision? 1. What about LCA merge of inventories? Canonical form -------------- There are five fragment types for the canonical form: The root node, hash trie internal and leaf nodes as previous. Then we have two more, the internal and leaf node for the radix tree. radix_node: INTERNAL_NODE or LEAF_NODE INTERNAL_NODE: INTERNAL_NODE_SIGNATURE prefix: PREFIX suffix CHK TYPE SIZE suffix CHK TYPE SIZE ... (Where TYPE is I for internal or L for leaf). LEAF_NODE: LEAF_NODE_SIGNATURE prefix: PREFIX suffix\x00VALUE For the content map we use the same value as for hashtrie inventories. Node splitting and joining in the radix tree are managed in the same fashion as as for the internal nodes of the hashtries. Apply ----- Apply is implemented as for hashtries - we just remove and reinsert the fileid:paths map entries, and likewise for the path:entry map. We can however cheaply detect invalid deltas where a delete fails to include its children. Delta ----- Delta generation is very similar to that with hash tries, except we get the path of nodes as part of the lookup process. Hash Trie details ================= The canonical form for a hash trie is a tree of internal nodes leading down to leaf nodes, with no node exceeding some threshold size, and every node containing as much content as it can, but no leaf node containing less than its lower size threshold. (In the event that an imbalance in the hash function causes a tree where an internal node is needed, but any prefix generates a child with less than the lower threshold, the smallest prefix should be taken). An internal node holds some number of key prefixes, all with the same bit-width. A leaf node holds the actual values. As trees do not spring fully-formed, the canonical form is defined iteratively - by taking every item in a tree and inserting it into a new tree in order you can determine what canonical form would look like. As that is an expensive operation, it should only be done rarely. Updates to a tree that is in canonical form can be done preserving canonical form if we can prove that our rules for insertion are order-independent, and that our rules for deletion generate the same tree as if we never inserted those nodes. Our hash tries are balanced vertically but not horizontally. That is, one leg of a tree can be arbitrarily deeper than adjacent legs. We require that each node along a path within the tree be densely packed, with the densest nodes near the top of the tree, and the least dense at the bottom. Except where the tree cannot support it, no node is smaller than a minimum_size, and none larger than maximum_size. The minimum size constraint is only applied when there are enough entries under a prefix to meet that minimum. The maximum size constraint is always applied except when a node with a single entry is larger than the maximum size. Loosely, the maximum size constraint wins over the minimum size constraint, and if the minimum size contraint is to be ignored, a deeper prefix can be chosen to pack the containing node more densely, as long as no additional minimum sizes checks on child nodes are violated. Insertion --------- #. Hash the entry, and insert the entry in the leaf node with a matching prefix, creating that node and linking it from the internal node containing that prefix if there is no appropriate leaf node. #. Starting at the highest node altered, for all altered nodes, check if it has transitioned across either size boundary - 0 < min_size < max_size. If it has not, proceed to update the CHK pointers. #. If it increased above min_size, check the node above to see if it can be more densely packed. To be below the min_size the node's parent must have hit the max size constraint and been forced to split even though this child did not have enough content to support a min_size node - so the prefix chosen in the parent may be shorter than desirable and we may now be able to more densely pack the parent by splitting the child nodes more. So if the parent node can support a deeper prefix without hitting max_size, and the count of under min_size nodes cannot be reduced, the parent should be given a deeper prefix. #. If it increased above max_size, shrink the prefix width used to split out new nodes until the node is below max_size (unless the prefix width is already 1 - the minimum). To shrink the prefix of an internal node, create new internal nodes for each new prefix, and populate them with the content of the nodes which were formerly linked. (This will normally bubble down due to keeping densely packed nodes). To shrink the prefix of a leaf node, create an internal node with the same prefix, then choose a width for the internal node such that the contents of the leaf all fit into new leaves obeying the min_size and max_size rules. The largest prefix possible should be chosen, to obey the higher-nodes-are-denser rule. That rule also gives room in leaf nodes for growth without affecting the parent node packing. #. Update the CHK pointers - serialise every altered node to generate a CHK, and update the CHK placeholder in the nodes parent; then reserialise the parent. CHK pointer propagation can be done lazily when many updates are expected. Multiple versions of nodes for the same PREFIX and internal prefix width should compress well for the same tree. Inventory deltas ================ An inventory is a serialization of the in-memory inventory delta. To serialize an inventory delta, one takes an existing inventory delta and the revision_id of the revision it was created it against and the revision id of the inventory which should result by applying the delta to the parent. We then serialize every item in the delta in a simple format: 'format: bzr inventory delta v1 (1.14)' NL 'parent:' SP BASIS_INVENTORY NL 'version:' SP NULL_OR_REVISION NL 'versioned_root:' SP BOOL NL 'tree_references:' SP BOOL NL DELTA_LINES DELTA_LINES ::= (DELTA_LINE NL)* DELTA_LINE ::= OLDPATH NULL NEWPATH NULL file-id NULL PARENT_ID NULL LAST_MODIFIED NULL CONTENT SP ::= ' ' BOOL ::= 'true' | 'false' NULL ::= \x00 OLDPATH ::= NONE | PATH NEWPATH ::= NONE | PATH NONE ::= 'None' PATH ::= path PARENT_ID ::= FILE_ID | '' CONTENT ::= DELETED_CONTENT | FILE_CONTENT | DIR_CONTENT | TREE_CONTENT | LINK_CONTENT DELETED_CONTENT ::= 'deleted' FILE_CONTENT ::= 'file' NULL text_size NULL EXEC NULL text_sha1 DIR_CONTENT ::= 'dir' TREE_CONTENT ::= 'tree' NULL tree-revision LINK_CONTENT ::= 'link' NULL link-target BASIS_INVENTORY ::= NULL_OR_REVISION LAST_MODIFIED ::= NULL_OR_REVISION NULL_OR_REVISION ::= 'null:' | REVISION REVISION ::= revision-id-in-utf8-no-whitespace EXEC ::= '' | 'Y' DELTA_LINES is lexicographically sorted. Some explanation is in order. When NEWPATH is 'None' a delete has been recorded, and because this inventory delta is not attempting to be a reversible delta, the only other valid fields are OLDPATH and 'file-id'. PARENT_ID is '' when a delete has been recorded or when recording a new root entry. Delta consistency ================= Inventory deltas and more broadly changes between trees are a significant part of Breezy's core operations: they are key components in status, diff, commit, and merge (although merge uses tree transform, deltas contain the changes that are applied to the transform). Our ability to perform a given operation depends on us creating consistent deltas between trees. Inconsistent deltas lead to errors and bugs, or even just unexpected conflicts. An inventory delta is a transform to change an inventory A into another inventory B (in patch terms its a perfect patch). Sometimes, for instance in a regular commit, inventory B is known at the time we create the delta. Other times, B is not known because the user is requesting that some parts of the second inventory they have are masked out from consideration. When this happens we create a delta that when applied to A creates a B we haven't seen in total before. In this situation we need to ensure that B will be internally consistent. Deltas are unidirectional, a delta(A, B) creates B from A, but cannot be used to create A from B. Deltas are expressed as a list of (oldpath, newpath, fileid, entry) tuples. The fileid, entry elements are normative; the old and new paths are strong hints but not currently guaranteed to be accurate. (This is a shame and something we should tighten up). Deltas are required to list all removals explicitly - removing the parent of an entry doesn't remove the entry. Applying a delta to an inventory consists of: - removing all fileids for which entry is None - adding or replacing all other fileids - detecting consistency errors An interesting aspect of delta inconsistencies is when we notice them: - Silent errors which our application logic misses - Visible errors we catch during application, so bad data isn't stored in the system. The minimum safe level for our application logic would be to catch all errors during application. Making generation never generate inconsistent deltas is a seperate but necessary condition for robust code. An inconsistent delta is one which: - after application to an inventory the inventory is an impossible state. - has the same fileid, or oldpath(not-None), or newpath(not-None) multiple times. - has a fileid field different to the entry.fileid in the same item in the delta. - has an entry that is in an impossible state (e.g. a directory with a text size) Forms of inventory inconsistency deltas can carry/cause: - An entry newly introduced to a path without also removing or relocating any existing entry at that path. (Duplicate paths) - An entry whose parent id isn't present in the tree. (Missing parent). - Having oldpath or newpath not be actual original path or resulting path. (Wrong path) - An entry whose parent is not a directory. (Under non-directory). - An entry that is internally inconsistent. - An entry that is already present in the tree (Duplicate id) Known causes of inconsistency: - A 'new' entry which the inventory already has - when this is a directory even arbitrary file ids under the 'new' entry are more likely to collide on paths. - Removing a directory without recursively removing its children - causes Missing parent. - Recording a change to an entry without including all changed entries found following its parents up to and includin the root - can cause duplicate paths, missing parents, wrong path, under non-directory. Avoiding inconsistent deltas ---------------------------- The simplest thing is to never create partial deltas, as it is trivial to be consistent when all data is examined every time. However users sometimes want to specify a subset of the changes in their tree when they do an operation which needs to create a delta - such as commit. We have a choice about handling user requests that can generate inconsistent deltas. We can alter or interpret the request in such a way that the delta will be consistent, but perhaps larger than the user had intended. Or we can identify problematic situations and abort, specifying to the user why we have aborted and likely things they can do to make their request generate a consistent delta. Currently we attempt to expand/interpret the request so that the user is not required to understand all the internal constraints of the system: if they request 'foo/bar' we automatically include foo. This works but can surprise the user sometimes when things they didn't explicitly request are committed. Different trees can use different algorithms to expand the request as long as they produce consistent deltas. As part of getting a consistent UI we require that all trees expand the paths requested downwards. Beyond that as long as the delta is consistent it is up to the tree. Given two trees, source and target, and a set of selected file ids to check for changes and if changed in a delta between them, we have to expand that set by the following rules, to get consistent deltas. The test for consistency is that if the resulting delta is applied to source, to create a third tree 'output', and the paths in the delta match the paths in source and output, only one file id is at each path in output, and no file ids are missing parents, then the delta is consistent. Firstly, the parent ids to the root for all of the file ids that have actually changed must be considered. Unless they are all examined the paths in the delta may be wrong. Secondly, when an item included in the delta has a new path which is the same as a path in source, the fileid of that path in source must be included. Failing to do this leads to multiple ids tryin to share a path in output. Thirdly, when an item changes its kind from 'directory' to anything else in the delta, all of the direct children of the directory in source must be included. bzrformats_3.4.0.orig/doc/packrepo.txt0000644000000000000000000002631015162203117014770 0ustar00========================== KnitPack repository format ========================== .. contents:: Using KnitPack repositories =========================== Motivation ---------- KnitPack is a new repository format for Breezy, which is expected to be faster both locally and over the network, is usually more compact, and will work with more FTP servers. Our benchmarking results to date have been very promising. We fully expect to make a pack-based format the default in the near future. We would therefore like as many people as possible using KnitPack repositories, benchmarking the results and telling us where improvements are still needed. Preparation ----------- A small percentage of existing repositories may have some inconsistent data within them. It's is a good idea to check the integrity of your repositories before migrating them to knitpack format. To do this, run:: bzr check If that reports a problem, run this command:: bzr reconcile Note that this can take many hours for repositories with deep history so be sure to set aside some time for this if it is required. Creating a new knitpack branch ------------------------------ If you're starting a project from scratch, it's easy to make it a ``knitpack`` one. Here's how:: cd my-stuff bzr init --pack-0.92 bzr add bzr commit -m "initial import" In other words, use the normal sequence of commands but add the ``--pack-0.92`` option to the ``init`` command. **Note:** In bzr 0.92, this format was called ``knitpack-experimental``. Creating a new knitpack repository ---------------------------------- If you're starting a project from scratch and wish to use a shared repository for branches, you can make it a ``knitpack`` repository like this:: cd my-repo bzr init-shared-repo --pack-0.92 . cd my-stuff bzr init bzr add bzr commit -m "initial import" In other words, use the normal sequence of commands but add the ``--pack-0.92`` option to the ``init-shared-repo`` command. Upgrading an existing branch or repository to knitpack format ------------------------------------------------------------- If you have an existing branch and wish to migrate it to a ``knitpack`` format, use the ``upgrade`` command like this:: bzr upgrade --pack-0.92 path-to-my-branch If you are using a shared repository, run:: bzr upgrade --pack-0.92 ROOT_OF_REPOSITORY to upgrade the history database. Note that this will not alter the branch format of each branch, so you will need to also upgrade each branch individually if you are upgrading from an old (e.g. < 0.17) bzr. More modern bzr's will already have the branch format at our latest branch format which adds support for tags. Starting a new knitpack branch from one in an older format ---------------------------------------------------------- This can be done in one of several ways: 1. Create a new branch and pull into it 2. Create a standalone branch and upgrade its format 3. Create a knitpack shared repository and branch into it Here are the commands for using the ``pull`` approach:: bzr init --pack-0.92 my-new-branch cd my-new-branch bzr pull my-source-branch Here are the commands for using the ``upgrade`` approach:: bzr branch my-source-branch my-new-branch cd my-new-branch bzr upgrade --pack-0.92 . Here are the commands for the shared repository approach:: cd my-repo bzr init-shared-repo --pack-0.92 . bzr branch my-source-branch my-new-branch cd my-new-branch As a reminder, any of the above approaches can fail if the source branch has inconsistent data within it and hasn't been reconciled yet. Please be sure to check that before reporting problems. Testing packs for bzr-svn users ------------------------------- If you are using ``bzr-svn`` or are testing the prototype subtree support, you can still use and assist in testing KnitPacks. The commands to use are identical to the ones given above except that the name of the format to use is ``knitpack-subtree-experimental``. WARNING: Note that the subtree formats, ``dirstate-subtree`` and ``knitpack-subtree-experimental``, are **not** production strength yet and may cause unexpected problems. They are required for the bzr-svn plug-in but should otherwise only be used by people happy to live on the bleeding edge. If you are using bzr-svn, you're on the bleeding edge anyway. :-) Reporting problems ------------------ If you need any help or encounter any problems, please contact the developers via the usual ways, i.e. chat to us on IRC or send a message to our mailing list. See https://www.breezy-vcs.org/pages/support.html for contact details. Technical notes =============== Bazaar 0.92 adds a new format (experimental at first) implemented in ``breezy.repofmt.pack_repo.py``. This format provides a knit-like interface which is quite compatible with knit format repositories: you can get a VersionedFile for a particular file-id, or for revisions, or for the inventory, even though these do not correspond to single files on disk. The on-disk format is that the repository directory contains these files and subdirectories: ==================== ============================================= packs/ completed readonly packs indices/ indices for completed packs upload/ temporary files for packs currently being written obsolete_packs/ packs that have been repacked and are no longer normally needed pack-names index of all live packs lock/ lockdir ==================== ============================================= Note that for consistency we always write "indices" not "indexes". This is implemented on top of pack files, which are written once from start to end, then left alone. A pack consists of a body file, plus several index files. There are four index files for each pack, which have the same basename and an extension indicating the purpose of the index: ======== ========== ======================== ========================== extn Purpose Key References ======== ========== ======================== ========================== ``.tix`` File texts ``file_id, revision_id`` per-file parents, compression basis per-file parents ``.six`` Signatures ``revision_id,`` - ``.rix`` Revisions ``revision_id,`` revision parents ``.iix`` Inventory ``revision_id,`` revision parents, compression base ======== ========== ======================== ========================== Indices are accessed through the ``breezy.index.GraphIndex`` class. Indices are stored as sorted files on disk. Each line is one record, and contains: * key fields * a value string - for all these indices, this is an ascii decimal pair of "offset length" giving the position of the referenced data within the pack body file * a list of zero or more reference lists The reference lists let a graph be stored within the index. Each reference list entry points to another entry in the same index. The references are represented as a byte offset for the target within the index file. When a compression base is given, it indicates that the body of the text or inventory is a forward delta from the referenced revision. The compression base list must have length 0 or 1. Like packs, indexes are written only once and then unmodified. A GraphIndex builder is a mutable in-memory graph that can be sorted, cross-referenced and written out when the write group completes. There can also be index entries with a value of 'a' for absent. These records exist just to be pointed to in a graph. This is used, for example, to give the revision-parent pointer when the parent revision is in a previous pack. The data content for each record is a knit data chunk. The knits are always unannotated - the annotations must be generated when needed. (We'd like to cache/memoize the annotations.) The data hunks can be moved between packs without needing to recompress them. It is not possible to regenerate an index from the body file, because it contains information stored in the knit index that's not in the body. (In particular, the per-file graph is only stored in the index.) We would like to change this in a future format. The lock is a regular LockDir lock. The lock is only held for a much reduced scope, while updating the pack-names file. The bulk of the insertion can be done without the repository locked. This is an implementation detail; the repository user should still call ``repository.lock_write`` at the regular time but be aware this does not correspond to a physical mutex. Read locks control caching but do not affect writers. The newly-added repository write group concept is very important to KnitPack repositories. When ``start_write_group`` is called, a new temporary pack is created and all modifications to the repository will go into it until either ``commit_write_group`` or ``abort_write_group`` is called, at which time it is either finished and moved into place or discarded respectively. Write groups cannot be nested, only one can be underway at a time on a Repository instance and they must occur within a write lock. Normally the data for each revision will be entirely within a single pack but this is not required. When a pack is finished, it gets a final name based on the md5 of all the data written into the pack body file. The ``pack-names`` file gives the list of all finished non-obsolete packs. (This should always be the same as the list of files in the ``packs/`` directory, but the file is needed for read-only HTTP clients that can't easily list directories, and it includes other information.) The constraint on the ``pack-names`` list is that every file mentioned must exist in the ``packs/`` directory. In rare cases, when a writer is interrupted, about-to-be-removed packs may still be present in the directory but removed from the list. As well as the list of names, the pack-names file also contains the size, in bytes, of each of the four indices. This is used to bootstrap bisection search within the indices. In normal use, one pack will be created for each commit to a repository. This would build up to an inefficient number of files over time, so a ``repack`` operation is available to recombine them, by producing larger files containing data on multiple revisions. This can be done manually by running ``bzr pack``, and it also may happen automatically when a write group is committed. The repacking strategy used at the moment tries to balance not doing too much work during commit with not having too many small files left in the repository. The algorithm is roughly this: the total number of revisions in the repository is expressed as a decimal number, e.g. "532". Then we'll repack until we have five packs containing a hundred revisions each, three packs containing ten revisions each, and two packs with single revisions. This means that each revision will normally initially be created in a single-revision pack, then moved to a ten-revision pack, then to a 100-pack, and so on. As with other repositories, in normal use data is only inserted. However, in some circumstances we may want to garbage-collect or prune existing data, or reconcile indexes. .. vim: tw=72 ft=rst expandtab bzrformats_3.4.0.orig/doc/repository-stream.txt0000644000000000000000000001604315162203117016676 0ustar00================== Repository Streams ================== Status ====== :Date: 2008-04-11 This document describes the proposed programming interface for streaming data from and into repositories. This programming interface should allow a single interface for pulling data from and inserting data into a Breezy repository. .. contents:: Motivation ========== To eliminate the current requirement that extracting data from a repository requires either using a slow format, or knowing the format of both the source repository and the target repository. Use Cases ========= Here's a brief description of use cases this interface is intended to support. Fetch operations ---------------- We fetch data between repositories as part of push/pull/branch operations. Fetching data is currently an very interactive process with lots of requests. For performance having the data be supplied in a stream will improve push and pull to remote servers. For purely local operations the streaming logic should help reduce memory pressure. In fetch operations we always know the formats of both the source and target. Smart server operations ~~~~~~~~~~~~~~~~~~~~~~~ With the smart server we support one streaming format, but this is only usable when both the client and server have the same model of data, and requires non-optimal IO ordering for pack to pack operations. Ideally we can both provide optimal IO ordering the pack to pack case, and correct ordering for pack to knits. Bundles ------- Bundles also create a stream of data for revisions from a repository. Unlike fetch operations we do not know the format of the target at the time the stream is created. It would be good to be able to treat bundles as frozen branches and repositories, so a serialised stream should be suitable for this. Data conversion --------------- At this point we are not trying to integrate data conversion into this interface, though it is likely possible. Characteristics =============== Some key aspects of the described interface are discussed in this section. Single round trip ----------------- All users of this should be able to create an appropriate stream from a single round trip. Forward-only reads ------------------ There should be no need to seek in a stream when inserting data from it into a repository. This places an ordering constraint on streams which some repositories do not need. Serialisation ============= At this point serialisation of a repository stream has not been specified. Some considerations to bear in mind about serialisation are worth noting however. Weaves ------ While there shouldn't be too many users of weave repositories anymore, avoiding pathological behaviour when a weave is being read is a good idea. Having the weave itself embedded in the stream is very straight forward and does not need expensive on the fly extraction and re-diffing to take place. Bundles ------- Being able to perform random reads from a repository stream which is a bundle would allow stacking a bundle and a real repository together. This will need the pack container format to be used in such a way that we can avoid reading more data than needed within the pack container's readv interface. Specification ============= This describes the interface for requesting a stream, and the programming interface a stream must provide. Streams that have been serialised should expose the same interface. Requesting a stream ------------------- To request a stream, three parameters are needed: * A revision search to select the revisions to include. * A data ordering flag. There are two values for this - 'unordered' and 'topological'. 'unordered' streams are useful when inserting into repositories that have the ability to perform atomic insertions. 'topological' streams are useful when converting data, or when inserting into repositories that cannot perform atomic insertions (such as knit or weave based repositories). * A complete_inventory flag. When provided this flag signals the stream generator to include all the data needed to construct the inventory of each revision included in the stream, rather than just deltas. This is useful when converting data from a repository with a different inventory serialisation, as pure deltas would not be able to be reconstructed. Structure of a stream --------------------- A stream is an object. It can be consistency checked via the ``check`` method (which consumes the stream). The ``iter_contents`` method can be used to iterate the contents of the stream. The contents of the stream are a series of top level records, each of which contains one or more bytestrings (potentially as a delta against another item in the repository) and some optional metadata. Consuming a stream ------------------ To consume a stream, obtain an iterator from the streams ``iter_contents`` method. This iterator will yield the top level records. Each record has two attributes. One is ``key_prefix`` which is a tuple key prefix for the names of each of the bytestrings in the record. The other attribute is ``entries``, an iterator of the individual items in the record. Each item that the iterator yields is a factory which has metadata about the entry and the ability to return the compressed bytes. This factory can be decorated to allow obtaining different representations (for example from a compressed knit fulltext to a plain fulltext). In pseudocode:: stream = repository.get_repository_stream(search, UNORDERED, False) for record in stream.iter_contents(): for factory in record.entries: compression = factory.storage_kind print("Object %s, compression type %s, %d bytes long." % ( record.key_prefix + factory.key, compression, len(factory.get_bytes_as(compression)))) This structure should allow stream adapters to be written which can coerce all records to the type of compression that a particular client needs. For instance, inserting into weaves requires fulltexts, so a stream would be adapted for weaves by an adapter that takes a stream, and the target weave, and then uses the target weave to reconstruct full texts (which is all that the weave inserter would ask for). In a similar approach, a stream could internally delta compress many fulltexts and be able to answer both fulltext and compressed record requests without extra IO. factory metadata ~~~~~~~~~~~~~~~~ Valid attributes on the factory are: * sha1: Optional ascii representation of the sha1 of the bytestring (after delta reconstruction). * storage_kind: Required kind of storage compression that has been used on the bytestring. One of ``mpdiff``, ``knit-annotated-ft``, ``knit-annotated-delta``, ``knit-ft``, ``knit-delta``, ``fulltext``. * parents: Required graph parents to associate with this bytestring. * compressor_data: Required opaque data relevant to the storage_kind. (This is set to None when the compressor has no special state needed) * key: The key for this bytestring. Like each parent this is a tuple that should have the key_prefix prepended to it to give the unified repository key name. .. vim: ft=rst tw=74 ai bzrformats_3.4.0.orig/doc/repository.txt0000644000000000000000000003714415162203117015412 0ustar00============ Repositories ============ Status ====== :Date: 2007-07-08 This document describes the services repositories offer and need to offer within breezy. .. contents:: Motivation ========== To provide clarity to API and performance tradeoff decisions by centralising the requirements placed upon repositories. Terminology =========== A **repository** is a store of historical data for Breezy. Command Requirements ==================== ================== ==================== Command Needed services ================== ==================== Add None Annotate Annotated file texts, revision details Branch Fetch, Revision parents, Inventory contents, All file texts Bundle Maximally compact diffs (file and inventory), Revision graph difference, Revision texts. Commit Insert new texts, insert new inventory via delta, insert revision, insert signature Fetching Revision graph difference, ghost identification, stream data introduced by a set of revisions in some cheap form, insert data from a stream, validate data during insertion. Garbage Collection Exclusive lock the repository preventing readers. Revert Delta from working tree to historical tree, and then arbitrary file access to obtain the texts of differing files. Uncommit Revision graph access. Status Revision graph access, revision text access, file fingerprint information, inventory differencing. Diff As status but also file text access. Merge As diff but needs up to twice as many file texts - base and other for each changed file. Also an initial fetch is needed. Log Revision graph (entire at the moment) access, sometimes status between adjacent revisions. Log of a file needs per-file-graph. Dominator caching or similar tools may be needed to prevent entire graph access. Missing Revision graph access, and revision texts to show output. Update As for merge, but twice. ================== ==================== Data access patterns ==================== Ideally we can make our data access for commands such as branch to dovetail well with the native storage in the repository, in the common case. Doing this may require choosing the behaviour of some commands to allow us to have a smaller range of access patterns which we can optimise more heavily. Alternatively if each command is very predicable in its data access pattern we may be able to hint to the low level layers which pattern is needed on a per command basis to get efficient behaviour. =================== =================================================== Command Data access pattern =================== =================================================== Annotate-cached Find text name in an inventory, Recreate one text, recreate annotation regions Annotate-on demand Find file id from name, then breadth-first pre-order traversal of versions-of-the-file until the annotation is complete. Branch Fetch, possibly taking a copy of any file present in a nominated revision when it is validated during fetch. Bundle Revision-graph as for fetch; then inventories for selected revision_ids to determine file texts, then mp-parent deltas for all determined file texts. Commit Something like basis-inventories read to determine per-file graphs, insertion of new texts (which may be delta compressed), generation of annotation regions if the repository is configured to do so, finalisation of the inventory pointing at all the new texts and finally a revision and possibly signature. Fetching Revision-graph searching to find the graph difference. Scan the inventory data introduced during the selected revisions, and grab the on disk data for the found file texts, annotation region data, per-file-graph data, piling all this into a stream. Garbage Collection Basically a mass fetch of all the revisions which branches point at, then a bait and switch with the old repository thus removing unreferenced data. Revert Revision graph access for the revision being reverted to, inventory extraction of that revision, dirblock-order file text extract for files that were different. Uncommit Revision graph access to synthesise pending-merges linear access down left-hand-side, with is_ancestor checks between all the found non-left-hand-side parents. Status Lookup the revisions added by pending merges and their commit messages. Then an inventory difference between the trees involved, which may include a working tree. If there is a working tree involved then the file fingerprint for cache-misses on files will be needed. Note that dirstate caches most of this making repository performance largely irrelevant: but if it was fast enough dirstate might be able to be simpler/ Diff As status but also file text access for every file that is different - either one text (working tree diff) or a diff of two (revision to revision diff). Merge As diff but needs up to twice as many file texts - base and other for each changed file. Also an initial fetch is needed. Note that the access pattern is probably id-based at the moment, but that may be 'fixed' with the iter_changes based merge. Also note that while the texts from OTHER are the ones accessed, this is equivalent to the **newest** form of each text changed from BASE to OTHER. And as the repository looks at when data is introduced, this should be the pattern we focus on for merge. Log Revision graph (entire at the moment) access, log of a file wants a per-file-graph. Log -v will want newest-first inventory deltas between revisions. Missing Revision graph access, breadth-first pre-order. Update As for merge, but twice. =================== =================================================== Patterns used ------------- Note that these are able to be changed by changing what we store. For instance if the repository satisfies mpdiff requests, then bundle can be defined in terms of mpdiff lookups rather than file text lookups appropriate to create mpdiffs. If the repository satisfies full text requests only, then you need the topological access to build up the desired mpdiffs. =========================================== ========= Pattern Commands =========================================== ========= Single file text annotate, diff Files present in one revision branch Newest form of files altered by revisions merge, update? Topological access to file versions/deltas annotate-uncached Stream all data required to recreate revs branch (lightweight) Stream file texts in topological order bundle Write full versions of files, inv, rev, sig commit Write deltas of files, inv for one tree commit Stream all data introduced by revs fetch Regenerate/combine deltas of many trees fetch, pack Reconstruct all texts and validate trees check, fetch Revision graph walk fetch, pack, uncommit, annotate-uncached, merge, log, missing Top down access multiple invs concurrently status, diff, merge?, update? Concurrent access to N file texts diff, merge Iteration of inventory deltas log -v, fetch? =========================================== ========= Facilities to scale well ======================== Indices ------- We want < linear access to all data in the repository. This suggests everything is indexed to some degree. Often we know the kind of data we are accessing; which allows us to partition our indices if that will help (e.g. by reducing the total index size for queries that only care about the revision graph). Indices that support our data access patterns will usually display increased locality of reference, reducing the impact of a large indices without needing careful page size management or other tricks. We need repository wide indices. For the current repositories this is achieved by dividing the keyspace (revisions, signatures, inventories, per-fileid) and then having an append only index within each keyspace. For pack based repositories we will want some means to query the index of each component pack, presumably as a single logical index. It would be nice if indexing was made cleanly separate from storage. So that suggests indices don't know the meaning of the lookup; indices which offer particular ordering, or graph walking facilities will clearly need that information, but perhaps they don't need to know the semantics ? Index size ~~~~~~~~~~ Smaller indexes are good. We could go with one big index, or a different index for different operation styles. As multiple indices will occupy more space in total we should consider carefully about adding indices. Index ordering ~~~~~~~~~~~~~~ Looking at the data access patterns some operations such as graph walking can clearly be made more efficient by offering direct iteration rather than repeated reentry into the index - so having indices that support iteration in such a style would be useful eventually. Changing our current indexes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can consider introducing cleaner indices in advance of a full pack based repository. There are many possibilities for this, but I've chosen one that seems ok to me for illustration. A key element is to consider when indices are updated. I think that the update style proposed for pack based repositories - write once, then when we group data again rewrite a new single index - is sufficent. Replace .kndx ^^^^^^^^^^^^^ We could discard the per-knit .kndx by writing a new index at the end of every Breezy transaction indexing the new data introduced by the Breezy operation. e.g. at the end of fetch. This can be based on the new ``GraphIndex`` index type. Encoding a knit entry into a ``GraphIndex`` can be done as follows: * Change the key to include a prefix of the knit name, to allow filtering out of data from different knits. * Encode the parents from the knit as the zeroth node reference list. * If the knit hunk was delta compressed encode the node it was delta compressed against as the 1st node reference list (otherwise the 1st node reference list will be empty to indicate no compression parents). * For the value encode similarly to the current knit format the byte offset for the data record in the knit, the byte length for the data record in the knit and the no-end-of-line flag. It's important to note that knit repositories cannot be regenerated by scanning .knits, so a mapped index is still irreplaceable and must be transmitted on push/pull. A potential improvement exists by specialising this further to not record data that is not needed - e.g. an index of revisions does not need to support a pointer to a parent compressed text as revisions.knit is not delta-compressed ever. Likewise signatures do not need the parent pointers at all as there is no 'signature graph'. Data ---- Moving to pack based repositories --------------------------------- We have a number of challenges to solve. Naming of files ~~~~~~~~~~~~~~~ As long as the file name is unique it does not really matter. It might be interesting to have it be deterministic based on content, but there are no specific problems we have solved by doing that, and doing so would require hashing the full file. OTOH hashing the full file is a cheap way to detect bit-errors in transfer (such as windows corruption). Non-reused file names are required for data integrity, as clients having read an index will readv at arbitrary times later. Discovery of files ~~~~~~~~~~~~~~~~~~ With non-listable transports how should the collection of pack/index files be found ? Initially record a list of all the pack/index files from write actions. (Require writable transports to be listable). We can then use a heuristic to statically combine pack/index files later. Housing files ~~~~~~~~~~~~~ Combining indices on demand ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Merging data on push ~~~~~~~~~~~~~~~~~~~~ A trivial implementation would be to make a pack which has just the data needed for the push, then send that. More sophisticated things would be streaming single-pass creation, and also using this as an opportunity to increase the packedness of the local repo. Choosing compression/delta support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Caching and writeing of data ============================ Repositories try to provide a consistent view of the data within them within a 'lock context'. Locks ----- Locks come in two flavours - read locks and write locks. Read locks allow data to be read from the repository. Write locks allow data to be read and signal that you intend to write data at some point. The actual writing of data must take place within a Write Group. Write locks provide a cache of repository data during the period of the write lock, and allow write_groups to be acquired. For some repositories the presence of a write lock is exclusive to a single client, for others which are lock free or use server side locks (e.g. svn), the write lock simply provides the cache context. Write Groups ------------ Write groups are the only allowed means for inserting data into a repository. These are created by ``start_write_group``, and concluded by either ``commit_write_group`` or ``abort_write_group``. A write lock must be held on the repository for the entire duration. At most one write group can be active on a repository at a time. Write groups signal to the repository the window during which data is actively being inserted. Several write groups could be committed during a single lock. There is no guarantee that data inserted during a write group will be invisible in the repository if the write group is not committed. Specifically repositories without atomic insertion facilities will be writing data as it is inserted within the write group, and may not be able to revert that data - e.g. in the event of a dropped SFTP connection in a knit repository, inserted file data will be visible in the repository. Some repositories have an atomic insertion facility, and for those all-or-nothing will apply. The precise meaning of a write group is format specific. For instance a knit based repository treats the write group methods as dummy calls, simply meeting the api that clients will use. A pack based repository will open a new pack container at the start of a write group, and rename it into place at commit time. .. vim: ft=rst tw=74 ai