Skip to content
  • Jeremy Evans's avatar
    e4f85bfc
    Implement Set as a core class · e4f85bfc
    Jeremy Evans authored
    
    
    Set has been an autoloaded standard library since Ruby 3.2.
    The standard library Set is less efficient than it could be, as it
    uses Hash for storage, which stores unnecessary values for each key.
    
    Implementation details:
    
    * Core Set uses a modified version of `st_table`, named `set_table`.
      than `s/st_/set_/`, the main difference is that the stored records
      do not have values, making them 1/3 smaller. `st_table_entry` stores
      `hash`, `key`, and `record` (value), while `set_table_entry` only
      stores `hash` and `key`.  This results in large sets using ~33% less
      memory compared to stdlib Set.  For small sets, core Set uses 12% more
      memory (160 byte object slot and 64 malloc bytes, while stdlib set
      uses 40 for Set and 160 for Hash).  More memory is used because
      the set_table is embedded and 72 bytes in the object slot are
      currently wasted. Hopefully we can make this more efficient and have
      it stored in an 80 byte object slot in the future.
    
    * All methods are implemented as cfuncs, except the pretty_print
      methods, which were moved to `lib/pp.rb` (which is where the
      pretty_print methods for other core classes are defined).  As is
      typical for core classes, internal calls call C functions and
      not Ruby methods.  For example, to check if something is a Set,
      `rb_obj_is_kind_of` is used, instead of calling `is_a?(Set)` on the
      related object.
    
    * Almost all methods use the same algorithm that the pure-Ruby
      implementation used.  The exception is when calling `Set#divide` with a
      block with 2-arity.  The pure-Ruby method used tsort to implement this.
      I developed an algorithm that only allocates a single intermediate
      hash and does not need tsort.
    
    * The `flatten_merge` protected method is no longer necessary, so it
      is not implemented (it could be).
    
    * Similar to Hash/Array, subclasses of Set are no longer reflected in
      `inspect` output.
    
    * RDoc from stdlib Set was moved to core Set, with minor updates.
    
    This includes a comprehensive benchmark suite for all public Set
    methods.  As you would expect, the native version is faster in the
    vast majority of cases, and multiple times faster in many cases.
    There are a few cases where it is significantly slower:
    
    * Set.new with no arguments (~1.6x)
    * Set#compare_by_identity for small sets (~1.3x)
    * Set#clone for small sets (~1.5x)
    * Set#dup for small sets (~1.7x)
    
    These are slower as Set does not currently use the AR table
    optimization that Hash does, so a new set_table is initialized for
    each call.  I'm not sure it's worth the complexity to have an AR
    table-like optimization for small sets (for hashes it makes sense,
    as small hashes are used everywhere in Ruby).
    
    The rbs and repl_type_completor bundled gems will need updates to
    support core Set.  The pull request marks them as allowed failures.
    
    This passes all set tests with no changes.  The following specs
    needed modification:
    
    * Modifying frozen set error message (changed for the better)
    * `Set#divide` when passed a 2-arity block no longer yields the same
      object as both the first and second argument (this seems like an issue
      with the previous implementation).
    * Set-like objects that override `is_a?` such that `is_a?(Set)` return
      `true` are no longer treated as Set instances.
    * `Set.allocate.hash` is no longer the same as `nil.hash`
    * `Set#join` no longer calls `Set#to_a` (it calls the underlying C
       function).
    * `Set#flatten_merge` protected method is not implemented.
    
    Previously, `set.rb` added a `SortedSet` autoload, which loads
    `set/sorted_set.rb`.  This replaces the `Set` autoload in `prelude.rb`
    with a `SortedSet` autoload, but I recommend removing it and
    `set/sorted_set.rb`.
    
    This moves `test/set/test_set.rb` to `test/ruby/test_set.rb`,
    reflecting that switch to a core class.  This does not move the spec
    files, as I'm not sure how they should be handled.
    
    Internally, this uses the st_* types and functions as much as
    possible, and only adds set_* types and functions as needed.
    The underlying set_table implementation is stored in st.c, but
    there is no public C-API for it, nor is there one planned, in
    order to keep the ability to change the internals going forward.
    
    For internal uses of st_table with Qtrue values, those can
    probably be replaced with set_table.  To do that, include
    internal/set_table.h.  To handle symbol visibility (rb_ prefix),
    internal/set_table.h uses the same macro approach that
    include/ruby/st.h uses.
    
    The Set class (rb_cSet) and all methods are defined in set.c.
    There isn't currently a C-API for the Set class, though C-API
    functions can be added as needed going forward.
    
    Implements [Feature #21216]
    
    Co-authored-by: default avatarJean Boussier <jean.boussier@gmail.com>
    Co-authored-by: default avatarOliver Nutter <mrnoname1000@riseup.net>
    e4f85bfc
    Implement Set as a core class
    Jeremy Evans authored
    
    
    Set has been an autoloaded standard library since Ruby 3.2.
    The standard library Set is less efficient than it could be, as it
    uses Hash for storage, which stores unnecessary values for each key.
    
    Implementation details:
    
    * Core Set uses a modified version of `st_table`, named `set_table`.
      than `s/st_/set_/`, the main difference is that the stored records
      do not have values, making them 1/3 smaller. `st_table_entry` stores
      `hash`, `key`, and `record` (value), while `set_table_entry` only
      stores `hash` and `key`.  This results in large sets using ~33% less
      memory compared to stdlib Set.  For small sets, core Set uses 12% more
      memory (160 byte object slot and 64 malloc bytes, while stdlib set
      uses 40 for Set and 160 for Hash).  More memory is used because
      the set_table is embedded and 72 bytes in the object slot are
      currently wasted. Hopefully we can make this more efficient and have
      it stored in an 80 byte object slot in the future.
    
    * All methods are implemented as cfuncs, except the pretty_print
      methods, which were moved to `lib/pp.rb` (which is where the
      pretty_print methods for other core classes are defined).  As is
      typical for core classes, internal calls call C functions and
      not Ruby methods.  For example, to check if something is a Set,
      `rb_obj_is_kind_of` is used, instead of calling `is_a?(Set)` on the
      related object.
    
    * Almost all methods use the same algorithm that the pure-Ruby
      implementation used.  The exception is when calling `Set#divide` with a
      block with 2-arity.  The pure-Ruby method used tsort to implement this.
      I developed an algorithm that only allocates a single intermediate
      hash and does not need tsort.
    
    * The `flatten_merge` protected method is no longer necessary, so it
      is not implemented (it could be).
    
    * Similar to Hash/Array, subclasses of Set are no longer reflected in
      `inspect` output.
    
    * RDoc from stdlib Set was moved to core Set, with minor updates.
    
    This includes a comprehensive benchmark suite for all public Set
    methods.  As you would expect, the native version is faster in the
    vast majority of cases, and multiple times faster in many cases.
    There are a few cases where it is significantly slower:
    
    * Set.new with no arguments (~1.6x)
    * Set#compare_by_identity for small sets (~1.3x)
    * Set#clone for small sets (~1.5x)
    * Set#dup for small sets (~1.7x)
    
    These are slower as Set does not currently use the AR table
    optimization that Hash does, so a new set_table is initialized for
    each call.  I'm not sure it's worth the complexity to have an AR
    table-like optimization for small sets (for hashes it makes sense,
    as small hashes are used everywhere in Ruby).
    
    The rbs and repl_type_completor bundled gems will need updates to
    support core Set.  The pull request marks them as allowed failures.
    
    This passes all set tests with no changes.  The following specs
    needed modification:
    
    * Modifying frozen set error message (changed for the better)
    * `Set#divide` when passed a 2-arity block no longer yields the same
      object as both the first and second argument (this seems like an issue
      with the previous implementation).
    * Set-like objects that override `is_a?` such that `is_a?(Set)` return
      `true` are no longer treated as Set instances.
    * `Set.allocate.hash` is no longer the same as `nil.hash`
    * `Set#join` no longer calls `Set#to_a` (it calls the underlying C
       function).
    * `Set#flatten_merge` protected method is not implemented.
    
    Previously, `set.rb` added a `SortedSet` autoload, which loads
    `set/sorted_set.rb`.  This replaces the `Set` autoload in `prelude.rb`
    with a `SortedSet` autoload, but I recommend removing it and
    `set/sorted_set.rb`.
    
    This moves `test/set/test_set.rb` to `test/ruby/test_set.rb`,
    reflecting that switch to a core class.  This does not move the spec
    files, as I'm not sure how they should be handled.
    
    Internally, this uses the st_* types and functions as much as
    possible, and only adds set_* types and functions as needed.
    The underlying set_table implementation is stored in st.c, but
    there is no public C-API for it, nor is there one planned, in
    order to keep the ability to change the internals going forward.
    
    For internal uses of st_table with Qtrue values, those can
    probably be replaced with set_table.  To do that, include
    internal/set_table.h.  To handle symbol visibility (rb_ prefix),
    internal/set_table.h uses the same macro approach that
    include/ruby/st.h uses.
    
    The Set class (rb_cSet) and all methods are defined in set.c.
    There isn't currently a C-API for the Set class, though C-API
    functions can be added as needed going forward.
    
    Implements [Feature #21216]
    
    Co-authored-by: default avatarJean Boussier <jean.boussier@gmail.com>
    Co-authored-by: default avatarOliver Nutter <mrnoname1000@riseup.net>
Loading