Discussion:
The Inbox: Kernel-cmm.1198.mcz
(too old to reply)
c***@source.squeak.org
0000-11-25 05:14:03 UTC
Permalink
Chris Muller uploaded a new version of Kernel to project The Inbox:
http://source.squeak.org/inbox/Kernel-cmm.1198.mcz

==================== Summary ====================

Name: Kernel-cmm.1198
Author: cmm
Time: 23 November 2018, 11:12:47.414703 pm
UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
Ancestors: Kernel-eem.1197

- Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.
- If so, then #xxxClass can be banished.
- With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".

=============== Diff against Kernel-eem.1197 ===============

Item was added:
+ ----- Method: Object>>basicClass (in category 'class membership') -----
+ basicClass
+ "Primitive. Answer the object which is the receiver's class. Essential. See
+ Object documentation whatIsAPrimitive."
+
+ <primitive: 111>
+ self primitiveFailed!

Item was changed:
----- Method: Object>>class (in category 'class membership') -----
class
+ "Answer the object which is the receiver's class. Essential."
- "Primitive. Answer the object which is the receiver's class. Essential. See
- Object documentation whatIsAPrimitive."

+ ^ self basicClass!
- <primitive: 111>
- self primitiveFailed!

Item was changed:
----- Method: Object>>storeDataOn: (in category 'objects from disk') -----
storeDataOn: aDataStream
"Store myself on a DataStream. Answer self. This is a low-level DataStream/ReferenceStream method. See also objectToStoreOnDataStream. NOTE: This method must send 'aDataStream beginInstance:size:' and then (nextPut:/nextPutWeak:) its subobjects. readDataFrom:size: reads back what we write here."
| cntInstVars cntIndexedVars |

cntInstVars := self class instSize.
cntIndexedVars := self basicSize.
aDataStream
+ beginInstance: self class
- beginInstance: self xxxClass
size: cntInstVars + cntIndexedVars.
1 to: cntInstVars do:
[:i | aDataStream nextPut: (self instVarAt: i)].

"Write fields of a variable length object. When writing to a dummy
stream, don't bother to write the bytes"
((aDataStream byteStream class == DummyStream) and: [self class isBits]) ifFalse: [
1 to: cntIndexedVars do:
[:i | aDataStream nextPut: (self basicAt: i)]].
!

Item was removed:
- ----- Method: Object>>xxxClass (in category 'class membership') -----
- xxxClass
- "For subclasses of nil, such as ObjectOut"
Levente Uzonyi
2018-11-24 13:39:39 UTC
Permalink
Post by c***@source.squeak.org
http://source.squeak.org/inbox/Kernel-cmm.1198.mcz
==================== Summary ====================
Name: Kernel-cmm.1198
Author: cmm
Time: 23 November 2018, 11:12:47.414703 pm
UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
Ancestors: Kernel-eem.1197
- Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.
It won't work while the special bytecode for #class is compiled. And even
after that, you have to recompile all senders of #class to make it use
the primitive and the new method instead of optimizing it away.
Post by c***@source.squeak.org
- If so, then #xxxClass can be banished.
- With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".
That won't work either for the same reason. And we do not want to remove
the bytecode, do we?

Levente
Post by c***@source.squeak.org
=============== Diff against Kernel-eem.1197 ===============
+ ----- Method: Object>>basicClass (in category 'class membership') -----
+ basicClass
+ "Primitive. Answer the object which is the receiver's class. Essential. See
+ Object documentation whatIsAPrimitive."
+
+ <primitive: 111>
+ self primitiveFailed!
----- Method: Object>>class (in category 'class membership') -----
class
+ "Answer the object which is the receiver's class. Essential."
- "Primitive. Answer the object which is the receiver's class. Essential. See
- Object documentation whatIsAPrimitive."
+ ^ self basicClass!
- <primitive: 111>
- self primitiveFailed!
----- Method: Object>>storeDataOn: (in category 'objects from disk') -----
storeDataOn: aDataStream
"Store myself on a DataStream. Answer self. This is a low-level DataStream/ReferenceStream method. See also objectToStoreOnDataStream. NOTE: This method must send 'aDataStream beginInstance:size:' and then (nextPut:/nextPutWeak:) its subobjects. readDataFrom:size: reads back what we write here."
| cntInstVars cntIndexedVars |
cntInstVars := self class instSize.
cntIndexedVars := self basicSize.
aDataStream
+ beginInstance: self class
- beginInstance: self xxxClass
size: cntInstVars + cntIndexedVars.
[:i | aDataStream nextPut: (self instVarAt: i)].
"Write fields of a variable length object. When writing to a dummy
stream, don't bother to write the bytes"
((aDataStream byteStream class == DummyStream) and: [self class isBits]) ifFalse: [
[:i | aDataStream nextPut: (self basicAt: i)]].
!
- ----- Method: Object>>xxxClass (in category 'class membership') -----
- xxxClass
- "For subclasses of nil, such as Ob
Chris Muller
2018-11-24 20:08:07 UTC
Permalink
Post by Levente Uzonyi
Post by c***@source.squeak.org
http://source.squeak.org/inbox/Kernel-cmm.1198.mcz
==================== Summary ====================
Name: Kernel-cmm.1198
Author: cmm
Time: 23 November 2018, 11:12:47.414703 pm
UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
Ancestors: Kernel-eem.1197
- Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.
It won't work while the special bytecode for #class is compiled. And even
after that, you have to recompile all senders of #class to make it use
the primitive and the new method instead of optimizing it away.
Right. Assuming we can achieve consensus with Eliot, and the next
Squeak will have a new VM, then that would be called from an MC post
script.

But what do you mean make all senders of #class use the primitive?
Just as you suggested the use of #ensureNonProxiedReceiver from the
other thread, the intention here is that #basicClass would better
document those performance-critical places, but leaving the majority
(of non-critical ones) sending #class, so it can be overridable.

Do you think the system would be noticably slower if all the sends to
#class became a message send? I'm skeptical that it would, but I have
no idea. I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.

Removing those byteCodes from my CompiledMethods is above my knowledge
level, but if you could help me come up with a script, I'd be
interested in testing and playing around to learn more.
Post by Levente Uzonyi
Post by c***@source.squeak.org
- If so, then #xxxClass can be banished.
- With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".
That won't work either for the same reason. And we do not want to remove
the bytecode, do we?
Not remove it, redirect it to #basicClass.

This is a reasonable and familiar pattern, right? It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clear selector name. No magic.

-
Levente Uzonyi
2018-11-24 23:43:01 UTC
Permalink
Post by Chris Muller
Post by Levente Uzonyi
Post by c***@source.squeak.org
http://source.squeak.org/inbox/Kernel-cmm.1198.mcz
==================== Summary ====================
Name: Kernel-cmm.1198
Author: cmm
Time: 23 November 2018, 11:12:47.414703 pm
UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
Ancestors: Kernel-eem.1197
- Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.
It won't work while the special bytecode for #class is compiled. And even
after that, you have to recompile all senders of #class to make it use
the primitive and the new method instead of optimizing it away.
Right. Assuming we can achieve consensus with Eliot, and the next
Squeak will have a new VM, then that would be called from an MC post
script.
I don't see what kind of VM changes are necessary here. Care to elaborate?
Post by Chris Muller
But what do you mean make all senders of #class use the primitive?
Currently, when you compile a method containing a send of #class, the
compiler will generate a special bytecode for it (199).
When the interpreter/jit sees this bytecode, it will not perform a send
nor a primitive; it'll just look up the class of the receiver and place it
on top of the stack.
You can see this in action by removing the sole implementor of #class from
your image without any effects. That method is only there for
consistency, it is never executed.

So, while the bytecode is in use, it doesn't matter what you do with the
#class method, because it will never be sent.
Post by Chris Muller
Just as you suggested the use of #ensureNonProxiedReceiver from the
other thread, the intention here is that #basicClass would better
document those performance-critical places, but leaving the majority
(of non-critical ones) sending #class, so it can be overridable.
See above.
Post by Chris Muller
Do you think the system would be noticably slower if all the sends to
#class became a message send? I'm skeptical that it would, but I have
Yes, the bytecode is way quicker than the primitive or a primitive + a
send which is exactly what you suggested.
Also, removing the bytecode will make #class lose its atomicity. Any code
that relies on that behavior will silently break. This pretty much applies
to all special selectors (See SmalltalkImage >> #specialSelectors).
Post by Chris Muller
no idea. I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.
I doubt that. People don't sprinkle #class sends for no reason, do they?
Post by Chris Muller
Removing those byteCodes from my CompiledMethods is above my knowledge
level, but if you could help me come up with a script, I'd be
interested in testing and playing around to learn more.
VariableNode has a class variable named StdSelectors. It contains the
selectors for which custom bytecodes are generated. Removing #class from
there should be enough.
Post by Chris Muller
Post by Levente Uzonyi
Post by c***@source.squeak.org
- If so, then #xxxClass can be banished.
- With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".
That won't work either for the same reason. And we do not want to remove
the bytecode, do we?
Not remove it, redirect it to #basicClass.
Right, but while the bytecode is in effect, you just can't redirect
it.

Levente
Post by Chris Muller
This is a reasonable and familiar pattern, right? It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clea
Chris Muller
2018-11-25 07:13:20 UTC
Permalink
Hi Levente,
Post by Levente Uzonyi
Post by Chris Muller
But what do you mean make all senders of #class use the primitive?
Currently, when you compile a method containing a send of #class, the
compiler will generate a special bytecode for it (199).
When the interpreter/jit sees this bytecode, it will not perform a send
nor a primitive; it'll just look up the class of the receiver and place it
on top of the stack.
Great! Does that mean this can be accomplished solely in the image by
making the compiler generate 199 when #basicClass is sent, and just
the normal "send" bytecode for sends to #class?
Post by Levente Uzonyi
Post by Chris Muller
Do you think the system would be noticably slower if all the sends to
#class became a message send? I'm skeptical that it would, but I have
Yes, the bytecode is way quicker than the primitive or a primitive + a
send which is exactly what you suggested.
It saves one send. One. That's only infinitesimally quicker:
_________
{ [1 xxxClass] bench.
[ 1 class ] bench. }

----> #('99,000,000 per second. 10.1 nanoseconds per run.'
'126,000,000 per second. 7.93 nanoseconds per run.')
________

2 nanoseconds per send faster. Inconsequential in any real-world
sense. Furthermore, as soon as the message sent to the class does
*any work* whatsoever, that good-sounding 27% improvement is quickly
wiped out. Look how much of the gain is lost doing as little as
creating one single Rectangle from another one:

___________
"Compare creating a single Rectangle with inlined #class vs. a
(proposed) message-send of #class."
| someRectangle | someRectangle := ***@50 corner: ***@200.
{ [someRectangle xxxClass origin: someRectangle topLeft corner:
someRectangle bottomRight ] bench.
[someRectangle class origin: someRectangle topLeft corner:
someRectangle bottomRight ] bench. }

---> #('37,200,000 per second. 26.9 nanoseconds per run.'
'38,000,000 per second. 26.3 nanoseconds per run.')
____________

Real-world gain by the inlined send was reduced to... whew! I just
had to go learn about "Picosecond" because nanoseconds aren't even
small enough to measure the improvement.

So, amplify. Crank it up to 100K:
__________
"Compare creating a 100,000 Rectangles with inlined #class vs. a
message-send of #class."
| someRectangle | someRectangle := ***@50 corner: ***@200.
{ [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.
[ 100000 timesRepeat: [someRectangle class origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench. }

---> #('364 per second. 2.75 milliseconds per run.' '369 per
second. 2.71 milliseconds per run.')
_________

Nothing times 100K is still nothing.
Post by Levente Uzonyi
Also, removing the bytecode will make #class lose its atomicity. Any code
that relies on that behavior will silently break.
If THAT exists it needs a more intention-revealing selector than
#class that would let his peers know atomicity mattered there.
#basicClass is his friend.
Post by Levente Uzonyi
Post by Chris Muller
... I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.
I doubt that. People don't sprinkle #class sends for no reason, do they?
Sorry, I should not have said "ever". I was trying to say the system
probably spends most of its time sending to instance-side methods than
class-side methods.
Post by Levente Uzonyi
Post by Chris Muller
Not remove it, redirect it to #basicClass.
Right, but while the bytecode is in effect, you just can't redirect
it.
I'm racking my brain trying to understand this -- sorry... By
"redirect" I just meant change the Compiler to generate bytecode 199
for sends to #basicClass, and just the regular "send" bytecode for
sends to #class. Then, recompile all methods. Would that work?
Post by Levente Uzonyi
Post by Chris Muller
This is a reasonable and familiar pattern, right? It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clear selector name. No magic.
So, if
performance is not really hurt, and
we can keep sending #class if so insisted, and
we still have #basicClass, just in case, together
delineating an elegant seam between system-level vs. user-level access
in a classic Smalltalky way that even *I* can understand and use,
and give Squeak better Proxy support that helps Magma
then
would you let me have this?

You have a skill of making performance-considerations to such degrees
that I never even would have fathomed, and this has resulted in
immense performance benefits for Squeak. I do wish you liked Magma,
because I'm sure you could _obliterate_ many inefficiencies in the
code and design. But if not, I hope you can at least appreciate the
value proposition of this prop
Levente Uzonyi
2018-11-25 18:09:05 UTC
Permalink
Hi Chris,
Post by Chris Muller
Hi Levente,
Post by Levente Uzonyi
Post by Chris Muller
But what do you mean make all senders of #class use the primitive?
Currently, when you compile a method containing a send of #class, the
compiler will generate a special bytecode for it (199).
When the interpreter/jit sees this bytecode, it will not perform a send
nor a primitive; it'll just look up the class of the receiver and place it
on top of the stack.
Great! Does that mean this can be accomplished solely in the image by
making the compiler generate 199 when #basicClass is sent, and just
the normal "send" bytecode for sends to #class?
Post by Levente Uzonyi
Post by Chris Muller
Do you think the system would be noticably slower if all the sends to
#class became a message send? I'm skeptical that it would, but I have
Yes, the bytecode is way quicker than the primitive or a primitive + a
send which is exactly what you suggested.
_________
{ [1 xxxClass] bench.
[ 1 class ] bench. }
----> #('99,000,000 per second. 10.1 nanoseconds per run.'
'126,000,000 per second. 7.93 nanoseconds per run.')
________
2 nanoseconds per send faster. Inconsequential in any real-world
sense. Furthermore, as soon as the message sent to the class does
*any work* whatsoever, that good-sounding 27% improvement is quickly
wiped out. Look how much of the gain is lost doing as little as
___________
"Compare creating a single Rectangle with inlined #class vs. a
(proposed) message-send of #class."
someRectangle bottomRight ] bench.
someRectangle bottomRight ] bench. }
---> #('37,200,000 per second. 26.9 nanoseconds per run.'
'38,000,000 per second. 26.3 nanoseconds per run.')
____________
Real-world gain by the inlined send was reduced to... whew! I just
had to go learn about "Picosecond" because nanoseconds aren't even
small enough to measure the improvement.
__________
"Compare creating a 100,000 Rectangles with inlined #class vs. a
message-send of #class."
{ [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.
[ 100000 timesRepeat: [someRectangle class origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench. }
---> #('364 per second. 2.75 milliseconds per run.' '369 per
second. 2.71 milliseconds per run.')
_________
Nothing times 100K is still nothing.
That's not the right way to measure things that are so quick, because the
overhead of block activation is comparable to the runtime of the code
inside the block. Also, #timesRepeat: is not a good choice for
measurements for the very same reason: block creation + lots of block
activation.
Also, the nearby bytecodes affect what the JIT does. When more things can
be executed without performing a send, the overall performance gains
will be higher.
Post by Chris Muller
Post by Levente Uzonyi
Also, removing the bytecode will make #class lose its atomicity. Any code
that relies on that behavior will silently break.
If THAT exists it needs a more intention-revealing selector than
#class that would let his peers know atomicity mattered there.
#basicClass is his friend.
All special selectors do the same e.g. #==, #ifNil:, #ifTrue:. Do you
think all of those need #basicXXX methods?
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
... I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.
I doubt that. People don't sprinkle #class sends for no reason, do they?
Sorry, I should not have said "ever". I was trying to say the system
probably spends most of its time sending to instance-side methods than
class-side methods.
It's a common pattern to have instance-independent code on the class side.
Quick access to that is always a good thing.
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
Not remove it, redirect it to #basicClass.
Right, but while the bytecode is in effect, you just can't redirect
it.
I'm racking my brain trying to understand this -- sorry... By
"redirect" I just meant change the Compiler to generate bytecode 199
for sends to #basicClass, and just the regular "send" bytecode for
sends to #class. Then, recompile all methods. Would that work?
It might work, but you would need to identify and rewrite senders of
#class which rely on the presence of the bytecode. In my image there are
2174 senders, which is simply too much review in my opinion.

I did some measurements and found that the JIT makes the numbered
primitive almost as quick as the bytecode. The slowdown is only about 10%.
Your suggestion, which is send + bytecode is about 85% slower and loses
the atomicity of the message. So, you'd better leave the implementation of
#class as it is right now, because that would be quicker and would
preserve the atomicity as long as nothing overrides it.
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
This is a reasonable and familiar pattern, right? It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clear selector name. No magic.
So, if
performance is not really hurt, and
we can keep sending #class if so insisted, and
we still have #basicClass, just in case, together
delineating an elegant seam between system-level vs. user-level access
in a classic Smalltalky way that even *I* can understand and use,
and give Squeak better Proxy support that helps Magma
then
would you let me have this?
As I wrote it a few emails earlier, I'd rather have a "switch" for this
than forcing it on everyone who don't use proxies at all (I presume that's
the current majority of Squeak users).

Levente
Post by Chris Muller
You have a skill of making performance-considerations to such degrees
that I never even would have fathomed, and this has resulted in
immense performance benefits for Squeak. I do wish you liked Magma,
because I'm sure you could _obliterate_ many inefficiencies in the
code and design. But if not, I hope you can at least appreciate the
value pr
Chris Muller
2018-11-25 21:37:34 UTC
Permalink
Hi Levente,
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
Do you think the system would be noticably slower if all the sends to
#class became a message send? ...
Yes, the bytecode is way quicker than the primitive or a primitive + a
send which is exactly what you suggested.
So even though you answered a different question, I was still curious
by your claim, and remembered that you're one has liked to communicate
with benchmarks. That's why I ran and presented them to you, but I'm
not sure if we're interpreting the results relative to my question or
some other question...
Post by Levente Uzonyi
Post by Chris Muller
_________
{ [1 xxxClass] bench.
[ 1 class ] bench. }
----> #('99,000,000 per second. 10.1 nanoseconds per run.'
'126,000,000 per second. 7.93 nanoseconds per run.')
________
2 nanoseconds per send faster. Inconsequential in any real-world
sense. Furthermore, as soon as the message sent to the class does
*any work* whatsoever, that good-sounding 27% improvement is quickly
wiped out. Look how much of the gain is lost doing as little as
___________
"Compare creating a single Rectangle with inlined #class vs. a
(proposed) message-send of #class."
someRectangle bottomRight ] bench.
someRectangle bottomRight ] bench. }
---> #('37,200,000 per second. 26.9 nanoseconds per run.'
'38,000,000 per second. 26.3 nanoseconds per run.')
____________
Real-world gain by the inlined send was reduced to... whew! I just
had to go learn about "Picosecond" because nanoseconds aren't even
small enough to measure the improvement.
__________
"Compare creating a 100,000 Rectangles with inlined #class vs. a
message-send of #class."
{ [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.
[ 100000 timesRepeat: [someRectangle class origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench. }
---> #('364 per second. 2.75 milliseconds per run.' '369 per
second. 2.71 milliseconds per run.')
_________
Nothing times 100K is still nothing.
That's not the right way to measure things that are so quick, because the
overhead of block activation is comparable to the runtime of the code
inside the block. Also, #timesRepeat: is not a good choice for
measurements for the very same reason: block creation + lots of block
activation.
Also, the nearby bytecodes affect what the JIT does. When more things can
be executed without performing a send, the overall performance gains
will be higher.
There are three benchmarks, did you notice the first two?

- The first one measures the single-unit cost of #xxxClass over
#class. This captures your theoretical maximum benefit of 27%, which
is terrible, because it can't come close to that in real code.

- The second demonstrates how 90% of that 27% benefit is wiped out
with no more than a single simple allocation -- what the vast majority
of class methods are responsible for.

- The third one measures "real world impact", and shows that this
particular in-line doesn't help the system in any way that helps any
human anywhere.
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Also, removing the bytecode will make #class lose its atomicity. Any code
that relies on that behavior will silently break.
If THAT exists it needs a more intention-revealing selector than
#class that would let his peers know atomicity mattered there.
#basicClass is his friend.
All special selectors do the same e.g. #==, #ifNil:, #ifTrue:. Do you
think all of those need #basicXXX methods?
No just #class. An identity-check should be an identity-check, even
against a Proxy. And does that example help illustrate how using #==
when you DON'T need an identity-check is a breakage of encapsulation?
It makes false assumptions and enforces type-conformance in a system
that wants to be empowered by messaging.
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
... I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.
I doubt that. People don't sprinkle #class sends for no reason, do they?
Sorry, I should not have said "ever". I was trying to say the system
probably spends most of its time sending to instance-side methods than
class-side methods.
It's a common pattern to have instance-independent code on the class side.
Quick access to that is always a good thing.
It's still quick! Levente, I challenge you to back up your claim by
identifying any one single method in the image which reports even only
a meaningfully better *bench* performance (much less real-world) by
calling it via #class instead of #xxxClass.

Anything whose performance matters at a level of one send is going to
use #basicClass anyway, just like we may have a few that we send
#basicNew instead of #new to.
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
Not remove it, redirect it to #basicClass.
Right, but while the bytecode is in effect, you just can't redirect
it.
I'm racking my brain trying to understand this -- sorry... By
"redirect" I just meant change the Compiler to generate bytecode 199
for sends to #basicClass, and just the regular "send" bytecode for
sends to #class. Then, recompile all methods. Would that work?
It might work, but you would need to identify and rewrite senders of
#class which rely on the presence of the bytecode. In my image there are
2174 senders, which is simply too much review in my opinion.
I repeat my challenge above!
Post by Levente Uzonyi
I did some measurements and found that the JIT makes the numbered
primitive almost as quick as the bytecode. The slowdown is only about 10%.
Your suggestion, which is send + bytecode is about 85% slower and loses
the atomicity of the message. So, you'd better leave the implementation of
#class as it is right now, because that would be quicker and would
preserve the atomicity as long as nothing overrides it.
Huh? No, you're only 27% faster in the *benchmark*, but near zero in
anything real-world.

My challenge above, stands. I would love to be wrong, so I could shed
my suspicion of whether this is about something else not mentioned...
:(
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
This is a reasonable and familiar pattern, right? It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clear selector name. No magic.
So, if
performance is not really hurt, and
we can keep sending #class if so insisted, and
we still have #basicClass, just in case, together
delineating an elegant seam between system-level vs. user-level access
in a classic Smalltalky way that even *I* can understand and use,
and give Squeak better Proxy support that helps Magma
then
would you let me have this?
As I wrote it a few emails earlier, I'd rather have a "switch" for this
than forcing it on everyone who don't use proxies at all (I presume that's
the current majority of Squeak users).
Whoa, hold on there. You only ever made one argument -- "performance"
-- which was obliterated by the benchmarks. Squeezing 27% more out of
a microbench of something called 0.0001% of the time results no
benefit to anyone anywhere.

I see MY position is the pro user position, and yours as the... pro
fastest-lab-result position, but hurts this Squeak user. I'm sad that
that alone isn't enough to support this. :(
_______
Do you remember when Behavior>>#new didn't always make a call to
#initialize? But at a time when Squeak was 10X slower than it is now,
the people then had the wisdom to understand that the computer and
software exists to eventually serve _users_, and that spiting users to
save one single send, even when it was a much greater percentage of
impact back then, was stil
Levente Uzonyi
2018-11-25 23:56:40 UTC
Permalink
Hi Chris,

This conversation is getting off the track, so let's take a step back and
try something different.
I had suggested you a solution: the "switch", but you never mentioned how
it worked for you. Perhaps my explanation wasn't clear.
Let me just give you a snippet which does exactly what I suggested.
Please try it in your image (one without Kernel-cmm.1198 loaded) and let
me know if it solved your problem or not:

(ParseNode classPool at: #StdSelectors) removeKey: #class.
Compiler recompileAll.

Levente

P.S.: Here's the benchmark I used to get my numbers:

runs := (1 to: 5) collect: [ :e |
{
[ 1 to: 50000000 do: [ :i | i class class class class class class
class class class class ] ] timeToRun.
[ 1 to: 50000000 do: [ :i | i classPrimitive classPrimitive
classPrimitive classPrimitive classPrimitive classPrimitive classPrimitive
classPrimitive classPrimitive classPrimitive ] ] timeToRun.
[ 1 to: 50000000 do: [ :i | i classSend classSend classSend
classSend classSend classSend classSend classSend classSend classSend ] ]
timeToRun.
[ 1 to: 50000000 do: [ :i | i ] ] timeToRun } ].
cleanRuns := runs collect: [ :e | (e - e last) allButLast ].
primitiveVsByteCode := (cleanRuns collect: [ :e | e second / e first ])
average printShowingMaxDecimalPlaces: 2.
sendVsByteCode := (cleanRuns collect: [ :e | e third / e first ]) average
printShowingMaxDecimalPlaces: 2.

Where Object >> #classPrimitive is

classPrimitive
"Primitive. Answer the object which is the receiver's class.
Essential. See
Object documentation whatIsAPrimitive."

<primitive: 111>
self primitiveFailed

And Object >> #classSend is

classSend
"Primitive. Answer the object which is the receiver's class.
Essential. See
Object documentation whatIsAPrimitive."

^self class
Post by Chris Muller
Hi Levente,
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
Do you think the system would be noticably slower if all the sends to
#class became a message send? ...
Yes, the bytecode is way quicker than the primitive or a primitive + a
send which is exactly what you suggested.
So even though you answered a different question, I was still curious
by your claim, and remembered that you're one has liked to communicate
with benchmarks. That's why I ran and presented them to you, but I'm
not sure if we're interpreting the results relative to my question or
some other question...
Post by Levente Uzonyi
Post by Chris Muller
_________
{ [1 xxxClass] bench.
[ 1 class ] bench. }
----> #('99,000,000 per second. 10.1 nanoseconds per run.'
'126,000,000 per second. 7.93 nanoseconds per run.')
________
2 nanoseconds per send faster. Inconsequential in any real-world
sense. Furthermore, as soon as the message sent to the class does
*any work* whatsoever, that good-sounding 27% improvement is quickly
wiped out. Look how much of the gain is lost doing as little as
___________
"Compare creating a single Rectangle with inlined #class vs. a
(proposed) message-send of #class."
someRectangle bottomRight ] bench.
someRectangle bottomRight ] bench. }
---> #('37,200,000 per second. 26.9 nanoseconds per run.'
'38,000,000 per second. 26.3 nanoseconds per run.')
____________
Real-world gain by the inlined send was reduced to... whew! I just
had to go learn about "Picosecond" because nanoseconds aren't even
small enough to measure the improvement.
__________
"Compare creating a 100,000 Rectangles with inlined #class vs. a
message-send of #class."
{ [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.
[ 100000 timesRepeat: [someRectangle class origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench. }
---> #('364 per second. 2.75 milliseconds per run.' '369 per
second. 2.71 milliseconds per run.')
_________
Nothing times 100K is still nothing.
That's not the right way to measure things that are so quick, because the
overhead of block activation is comparable to the runtime of the code
inside the block. Also, #timesRepeat: is not a good choice for
measurements for the very same reason: block creation + lots of block
activation.
Also, the nearby bytecodes affect what the JIT does. When more things can
be executed without performing a send, the overall performance gains
will be higher.
There are three benchmarks, did you notice the first two?
- The first one measures the single-unit cost of #xxxClass over
#class. This captures your theoretical maximum benefit of 27%, which
is terrible, because it can't come close to that in real code.
- The second demonstrates how 90% of that 27% benefit is wiped out
with no more than a single simple allocation -- what the vast majority
of class methods are responsible for.
- The third one measures "real world impact", and shows that this
particular in-line doesn't help the system in any way that helps any
human anywhere.
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Also, removing the bytecode will make #class lose its atomicity. Any code
that relies on that behavior will silently break.
If THAT exists it needs a more intention-revealing selector than
#class that would let his peers know atomicity mattered there.
#basicClass is his friend.
All special selectors do the same e.g. #==, #ifNil:, #ifTrue:. Do you
think all of those need #basicXXX methods?
No just #class. An identity-check should be an identity-check, even
against a Proxy. And does that example help illustrate how using #==
when you DON'T need an identity-check is a breakage of encapsulation?
It makes false assumptions and enforces type-conformance in a system
that wants to be empowered by messaging.
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
... I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.
I doubt that. People don't sprinkle #class sends for no reason, do they?
Sorry, I should not have said "ever". I was trying to say the system
probably spends most of its time sending to instance-side methods than
class-side methods.
It's a common pattern to have instance-independent code on the class side.
Quick access to that is always a good thing.
It's still quick! Levente, I challenge you to back up your claim by
identifying any one single method in the image which reports even only
a meaningfully better *bench* performance (much less real-world) by
calling it via #class instead of #xxxClass.
Anything whose performance matters at a level of one send is going to
use #basicClass anyway, just like we may have a few that we send
#basicNew instead of #new to.
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
Not remove it, redirect it to #basicClass.
Right, but while the bytecode is in effect, you just can't redirect
it.
I'm racking my brain trying to understand this -- sorry... By
"redirect" I just meant change the Compiler to generate bytecode 199
for sends to #basicClass, and just the regular "send" bytecode for
sends to #class. Then, recompile all methods. Would that work?
It might work, but you would need to identify and rewrite senders of
#class which rely on the presence of the bytecode. In my image there are
2174 senders, which is simply too much review in my opinion.
I repeat my challenge above!
Post by Levente Uzonyi
I did some measurements and found that the JIT makes the numbered
primitive almost as quick as the bytecode. The slowdown is only about 10%.
Your suggestion, which is send + bytecode is about 85% slower and loses
the atomicity of the message. So, you'd better leave the implementation of
#class as it is right now, because that would be quicker and would
preserve the atomicity as long as nothing overrides it.
Huh? No, you're only 27% faster in the *benchmark*, but near zero in
anything real-world.
My challenge above, stands. I would love to be wrong, so I could shed
my suspicion of whether this is about something else not mentioned...
:(
Post by Levente Uzonyi
Post by Chris Muller
Post by Levente Uzonyi
Post by Chris Muller
This is a reasonable and familiar pattern, right? It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clear selector name. No magic.
So, if
performance is not really hurt, and
we can keep sending #class if so insisted, and
we still have #basicClass, just in case, together
delineating an elegant seam between system-level vs. user-level access
in a classic Smalltalky way that even *I* can understand and use,
and give Squeak better Proxy support that helps Magma
then
would you let me have this?
As I wrote it a few emails earlier, I'd rather have a "switch" for this
than forcing it on everyone who don't use proxies at all (I presume that's
the current majority of Squeak users).
Whoa, hold on there. You only ever made one argument -- "performance"
-- which was obliterated by the benchmarks. Squeezing 27% more out of
a microbench of something called 0.0001% of the time results no
benefit to anyone anywhere.
I see MY position is the pro user position, and yours as the... pro
fastest-lab-result position, but hurts this Squeak user. I'm sad that
that alone isn't enough to support this. :(
_______
Do you remember when Behavior>>#new didn't always make a call to
#initialize? But at a time when Squeak was 10X slower than it is now,
the people then had the wisdom to understand that the computer and
software exists to eventually serve _users_, and that spiting users to
save one single send, even when it was a much greater percentage of
impact back then, was still way wo
Chris Muller
2018-11-25 22:54:27 UTC
Permalink
Post by Levente Uzonyi
That's not the right way to measure things that are so quick, because the
overhead of block activation is comparable to the runtime of the code
inside the block.
I get you, but that its so hard to even write such a test indicates
that real-world code also needs to do a lot of block-activations, and
so this quickly dilutes the density of calls to #class.

The only way I could think was to just cut-and-paste the block innards
X 100 times and measure the degradation from the baseline (single):

{ [1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. ] bench.

[ 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. ] bench. }

#('2,780,000 per second. 360 nanoseconds per run.' '5,590,000 per
second. 179 nanoseconds per run.')

So X100 more density of calls to #xxxClass degraded the performance
from 27% slower to 50% slower.

So the real question is how dense are the calls to #class, and are
they mostly from only a few senders which could retain the
optimization by #basicClass? It would be an interesting experiment.
Pointless, though, if there's no chance of
Loading...