Discussion:
The Inbox: Collections-cbc.812.mcz
(too old to reply)
c***@source.squeak.org
0000-11-17 06:06:34 UTC
Permalink
A new version of Collections was added to project The Inbox:
http://source.squeak.org/inbox/Collections-cbc.812.mcz

==================== Summary ====================

Name: Collections-cbc.812
Author: cbc
Time: 15 November 2018, 10:06:05.244237 pm
UUID: 1977358c-ec82-5442-8f1d-027d4bc817a3
Ancestors: Collections-eem.811

Make intervals that are #= have the same hash.
Also, slight adjustment to #=.

=============== Diff against Collections-eem.811 ===============

Item was changed:
----- Method: Interval>>= (in category 'comparing') -----
= anObject
-
^ self == anObject
+ or: [anObject isInterval
+ ifFalse: [super = anObject]
+ ifTrue:
+ [start = anObject first
+ and: [step = anObject increment
+ and: [self last = anObject last]]]]!
- ifTrue: [true]
- ifFalse: [anObject isInterval
- ifTrue: [start = anObject first
- and: [step = anObject increment
- and: [self last = anObject last]]]
- ifFalse: [super = anObject]]!

Item was changed:
----- Method: Interval>>hash (in category 'comparing') -----
hash
"Hash is reimplemented because = is implemented."

^(((start hash bitShift: 2)
+ bitOr: self last hash)
- bitOr: stop
c***@source.squeak.org
0000-11-17 06:07:07 UTC
Permalink
Chris Cunningham uploaded a new version of Collections to project The Inbox:
http://source.squeak.org/inbox/Collections-cbc.812.mcz

==================== Summary ====================

Name: Collections-cbc.812
Author: cbc
Time: 15 November 2018, 10:06:05.244237 pm
UUID: 1977358c-ec82-5442-8f1d-027d4bc817a3
Ancestors: Collections-eem.811

Make intervals that are #= have the same hash.
Also, slight adjustment to #=.

=============== Diff against Collections-eem.811 ===============

Item was changed:
----- Method: Interval>>= (in category 'comparing') -----
= anObject
-
^ self == anObject
+ or: [anObject isInterval
+ ifFalse: [super = anObject]
+ ifTrue:
+ [start = anObject first
+ and: [step = anObject increment
+ and: [self last = anObject last]]]]!
- ifTrue: [true]
- ifFalse: [anObject isInterval
- ifTrue: [start = anObject first
- and: [step = anObject increment
- and: [self last = anObject last]]]
- ifFalse: [super = anObject]]!

Item was changed:
----- Method: Interval>>hash (in category 'comparing') -----
hash
"Hash is reimplemented because = is implemented."

^(((start hash bitShift: 2)
+ bitOr: self last hash)
- bitOr: stop hash)
bitShift: 1)
bitOr: self s
Levente Uzonyi
2018-11-16 10:42:11 UTC
Permalink
Post by c***@source.squeak.org
http://source.squeak.org/inbox/Collections-cbc.812.mcz
----- Method: Interval>>hash (in category 'comparing') -----
hash
"Hash is reimplemented because = is implemented."
^(((start hash bitShift: 2)
+ bitOr: self last hash)
- bitOr: stop hash)
bitShift: 1)
bitOr: self size!
As Bert pointed out, #bitOr: is not a good choice for hashing, because
it loses entropy. Also, I'd use #hashMultiply instead of ad-hoc bit
shifts to maximize the entropy of the hash value. E.g.:

hash

^((start hash hashMultiply bitXor: self last hash) hashMultiply
bitXor: self size) hashMultiply

Below is a snippet comparing various hash functions:

hashes := Set new. "Original #hash"
hashes2 := Set new. "#hash from Collections-cbc.812"
hashes3 := Set new. "The above proposed #hash using #hashMultiply and #bitXor:"
intervals := Set new. "The maximum number of different #hash values"

Interval allInstancesDo: [ :each |
hashes add: each hash.
hashes2 add: each hash2.
hashes3 add: each hash3.
intervals add: each ].
{
hashes size.
hashes2 size.
hashes3 size.
intervals size }

"==> #(584 584 1113 1113)"

The sample is too small, but it suggests that the original #hash and the
#hash in Collections-cbc.812 are suboptimal.
If we create a few more intervals in a workspace:

newIntervals := (1 to: 10000) collect: [ :each | (1 to: each) ].
newIntervals2 := (1 to: 10000) collect: [ :each | (1 to: each by: 2) ].
newIntervals3 := (1 to: 10000) collect: [ :each | (each to: 1 by: -1) ].

then the numbers will be m
Chris Muller
2018-11-16 22:01:17 UTC
Permalink
#(6122 5368 26244 27043)
That's a pretty dramatic improvement in the hash dispersion.

On one hand, we were just trying to fix the discrepency with #=, not
actually improve the #hash. But, since we're in here anyway....

It would be a disruption if someone has them in a HashedCollection,
but probably minor since they can rehash, after which they should
enjoy better performance.

I do keep some large MagmaDictionary's which rely on the standard
#hash, but don't allow enumeration (due to their size), and so can't
really be rehashed except by rebuiding them. But, if I have any
Intervals in them, I can probably deal with it.

So my guess is this is probably a worthwhile improvement. I'll go
along with whatever y'all decide, but if its Levente's, please don't
forget to reparent to the trunk version.
Chris Cunningham
2018-11-16 22:59:17 UTC
Permalink
Chris,
If/when we implement this, the idea is to rehash the collection using:
HashedCollection rehashAll
Will this adverse affect your MagmaDictionary's? Or will those be handled
some other way?

-cbc
Post by Chris Muller
#(6122 5368 26244 27043)
That's a pretty dramatic improvement in the hash dispersion.
On one hand, we were just trying to fix the discrepency with #=, not
actually improve the #hash. But, since we're in here anyway....
It would be a disruption if someone has them in a HashedCollection,
but probably minor since they can rehash, after which they should
enjoy better performance.
I do keep some large MagmaDictionary's which rely on the standard
#hash, but don't allow enumeration (due to their size), and so can't
really be rehashed except by rebuiding them. But, if I have any
Intervals in them, I can probably deal with it.
So my guess is this is probably a worthwhile improvement. I'll go
along with whatever y'all decide, but if its Levente's, please don't
forget to reparent to the trunk version. :) Much appreciated!
Best,
Chris
Chris Muller
2018-11-16 23:11:01 UTC
Permalink
Post by Chris Cunningham
Chris,
HashedCollection rehashAll
Will this adverse affect your MagmaDictionary's?
No, but it won't help them at all, either.

Does anyone know if ReferenceStream automatially rehashes Dictionary's
and Sets when it materializes them? If not, then the impact of
changing the hash calculation is much higher than if they do.

Magma does, so no action should be needed for regular Sets and Dictionary's.
Post by Chris Cunningham
Or will those be handled some other way?
However, Magma has another special dictionary called a MagmaDictionary
which is designed to be larger than RAM. It maintains a reference to
its 'session', which communicates with the server to access the large
dictionary contents straight from the disk. If I have Intervals in
any of these, I'll have to manually plan to rebuild them from scratch
with a utility script(s), because they don't support rehashing the way
small, in-memory H
Eliot Miranda
2018-11-19 03:16:46 UTC
Permalink
Hi Chris,
Post by Chris Muller
Post by Chris Cunningham
Chris,
HashedCollection rehashAll
Will this adverse affect your MagmaDictionary's?
No, but it won't help them at all, either.
Does anyone know if ReferenceStream automatially rehashes Dictionary's
and Sets when it materializes them? If not, then the impact of
changing the hash calculation is much higher than if they do.
It has to, because: Since Object>>#= and Object>>#hash fall back on #==
and #identityHash, then Dictionary, Set et al instances are potentially
affected by identity. Therefore on reconstructing a Dictionary,. Set et al
instance an unpickler must rehash (*).


(*) unless it verifies that all elements implement their own #= and #hash,
which is intractable in practice; the only ways I can see of verifying that
an object does not use #== or #identityHash in its #= and #hash methods are

a) to analyze the code (impossible for a non-AI unpickler) or,
b) construct a shallow copy of an object (since lots of #= implementations
short-cut via "^self == other or: [...") and simulate #= and #hash, t5o see
if #== or #identityHash is sent

Either of these would slow down unpicking enormously; rehashing invokes #=
and #hash any way, but at full execution speed.
Post by Chris Muller
Magma does, so no action should be needed for regular Sets and
Dictionary's.
Post by Chris Cunningham
Or will those be handled some other way?
However, Magma has another special dictionary called a MagmaDictionary
which is designed to be larger than RAM. It maintains a reference to
its 'session', which communicates with the server to access the large
dictionary contents straight from the disk. If I have Intervals in
any of these, I'll have to manually plan to rebuild them from scratch
with a utility script(s), because they don't support rehashing the way
small, in-memory HashedCollections do.
- Chris
--
_,,,^..^,,,_
best, Eliot
Loading...