Yes, sharing data between guest GPU Kernels without having to transfer data back to the host CPU is the ideal situation.
__global__ void Addition (float *aD, float *bD, float *cD)where aD, bD, and cD are pointers to device memory and cD [index] = aD [index] + bD [index], you can use cD in the next kernel without copying the cD back to the CPU.
__global__ void ScalarMultiply (float *arrD, int scalar) could use the output of Addition without doing a copy to the CPU! The host code that launches the kernels might look something like:
// Do cudaMalloc allocations and cudaMemcpys here
// cudaMemcpy() or cudeDeviceSynchronize() on CPU Host will cause Host to wait for Kernels to complete
//These 2 guest calls will happen synchronously without explicit synchronize() call
Addition <<< gridSize, blockSize >>> (num1D, num2D, sumD);
// This call can re-use the sumD memory already in the GPU
ScalarMultiply <<<gridSize, blockSize>>> (sumD, 5);