Thanks BC for that link. I read through it and one of the external links sent me to a tutorials point youtube video that showed more information on this.
For some reason I was thinking that cache could be targeted to make use of an address range that was labelled and associated as cache, similar to how knowing a specific address you can perform memory calls & injection to fetch or write information to an address, and so if an address can be targeted then its known and so you can pass data to and from it. Analogy to my thought on this was say you have a system with a Hard Drive and a SSD, you can pass information to the SSD to have a faster read/write, but data that isnt critical to speed remain on the Hard Drive. I was thinking that maybe there was a function that could target cache directly, but its hands off and handled by the CPU only. So it seems that you can call a memory address directly to System RAM, but not cache. Cache is hands off and its a matter of keeping a program small so that the CPU might utilize its internal cache to contain it.
The program I am running is 553KB in size so I guess its pretty probable that the L3 is being utilized for it with 6MB L3 cache available at all 4 cores running 4 instances of the program because its single-threaded basically 4 instances of the program would consume 2.012MB of the 6MB. On the other processors that only have L1 and L2 cache it was likely that it was less efficient and hitting the System RAM more because 512KB per core of 2MB of L2 cache with 4 instances of the program running since its single-threaded the 553k program x4 doesnt fit within the L2 in entirety.
It looks like to make a program that makes better use of your cache memory, its a matter of keeping program size minimal and information that is called over and over again the CPU will pick up on this and run with it to place it into its cache at the 3 cache levels of processors with L1, L2, and L3 cache such as the AMD Phenom and FX processors that I have.
Been working on making a program more efficient and was thinking if I could get it into the L3 cache it would run much more efficiently with lesser wasted clock cycles. I've even thought about porting it to Linux and run it from a distro that doesnt have a GUI because Windows itself have processing overhead which is waste. Also did some reading into methods of utilizing GPU's for mathematical crunching but havent found any examples that show an easy way to tap into a GPU to process a program vs CPU.
I have two GTX 570 video cards and a GTX 780 Ti. Knowing that GPUs are better at crunching numbers ( The crypto currency farming was using them for a while to point this greater efficiency out ) .... it might be the better way to go, but I have yet to find any examples that show how to load a program into GPU and have it crunch away.
Project i am working on is sort of looking for needles in a hay stack. Its all out of curiosity on a project i have that shuffles 89 characters randomly and seeded random where the seed is a key to scramble and unscramble information. For the fact that its probable to have a perfect shuffle where such as with a deck of cards you could shuffle a deck and have it shuffle back to order in which the deck was purchased of 2,3,4,5,6,7,8,9,10, J, Q, K, A for each suit, I have had an interest in with a long long int used as a seed, the frequency at which weak keys occur as well as to hunt down each of the worst keys that shuffle back to close to or the exact original order. There isn't enough processing power in the world to run for every combination of 89 characters ( permutations of 89 ) in a shuffle. Here is an interesting link on permutations of 52 ( deck of cards )
https://www.quora.com/How-many-combinations-can-a-deck-of-52-cards-makeFrom the program that I use the best method to avoid a weak key is to just test the key before its used for strength which takes a fraction of a second. Currently the curiosity for searching for weak keys I put in a starting value and end value to run to and tell it the flag value (threshold) to which if so many or more characters match between start location of array 1 and destination location of array 2 per iteration it writes to file that key value so I can check it out further to see how bad it is. At looking for 15 characters or more that shuffle back to their original order of 89 I have run through 10 Billion keys so far with no hits. If I drop the threshold to 8 they start popping up here and there, so I know that the program is doing what its designed to do vs not reporting due to a flaw in code. Its just that 15 characters or more to be shuffled back to their original position is extremely rare according to the first 10 Billion tested and nothing found yet.
Performance of the Phenom II x4 945 3.0Ghz CPU is pretty good. I am able to test 1 Billion keys per 2 hours and 17 minutes with 250 million keys per core and 4 instances if the single-threaded program running with core affinity set to tag each instance to its own core.
The project is mainly a curiosity and not an insanity. And it acts as a small space heater during the cooler months of the year in which its currently 30F / -1C outside