MESSAGE
DATE | 2016-11-01 |
FROM | Ruben Safir
|
SUBJECT | Re: [Learn] how is it indexing in cuda
|
From learn-bounces-at-nylxs.com Tue Nov 1 23:08:37 2016 Return-Path: X-Original-To: archive-at-mrbrklyn.com Delivered-To: archive-at-mrbrklyn.com Received: from www.mrbrklyn.com (www.mrbrklyn.com [96.57.23.82]) by mrbrklyn.com (Postfix) with ESMTP id DFD08161312; Tue, 1 Nov 2016 23:08:36 -0400 (EDT) X-Original-To: learn-at-nylxs.com Delivered-To: learn-at-nylxs.com Received: from [10.0.0.62] (flatbush.mrbrklyn.com [10.0.0.62]) by mrbrklyn.com (Postfix) with ESMTP id 84848160E77; Tue, 1 Nov 2016 23:08:33 -0400 (EDT) To: Samir Iabbassen , learn-at-nylxs.com References: <69bd63a0-9c7f-dbd9-9846-fe71ada4dde6-at-mrbrklyn.com> <1478051308634.56730-at-liu.edu> From: Ruben Safir Message-ID: <7bc48c3a-841d-36b0-c243-9fc6866b867e-at-mrbrklyn.com> Date: Tue, 1 Nov 2016 23:08:33 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <1478051308634.56730-at-liu.edu> Subject: Re: [Learn] how is it indexing in cuda X-BeenThere: learn-at-nylxs.com X-Mailman-Version: 2.1.17 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: learn-bounces-at-nylxs.com Sender: "Learn"
On 11/01/2016 09:48 PM, Samir Iabbassen wrote: > Take a look at this document if you have time.I can explain why and how we use indexing in CUDA tomorrow. > http://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf > > You may take a look at other related lectures for CUDA programming on http://users.wfu.edu/choss/CUDA/docs/ > > >
I've been looking a the cuda docs, but they added to confusion, not relived it. The first two arguments in kernel instantiation
<<< B1 , T1 >>> those are dim3's (x y and z) but the explanation on how to access them is not comprehendible, at least to me.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration
B.19. Execution Configuration
Any call to a __global__ function must specify the execution configuration for that call. The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device, as well as the associated stream (see CUDA C Runtime for a description of streams).
The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:
Dg is of type dim3 (see dim3) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched; Db is of type dim3 (see dim3) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block; Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned in __shared__; Ns is an optional argument which defaults to 0; S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
Yikes!
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels
2.1. Kernels
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>>execution configuration syntax (see C Language Extensions). Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C:
// Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }
int main() { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); ... }
Here, each of the N threads that execute VecAdd() performs one pair-wise addition. 2.2. Thread Hierarchy
For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume.
The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).
As an example, the following code adds two matrices A and B of size NxN and stores the result into matrix C:
// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; }
int main() { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<>>(A, B, C); ... }
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks as illustrated by Figure 6. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed. Figure 6. Grid of Thread Blocks
Grid of Thread Blocks.
The number of threads per block and the number of blocks per grid specified in the <<<...>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.
Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-dimensional index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.
// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; }
int main() { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<>>(A, B, C); ... }
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common choice. The grid is created with enough blocks to have one thread per matrix element as before. For simplicity, this example assumes that the number of threads per grid in each dimension is evenly divisible by the number of threads per block in that dimension, although that need not be the case.
Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 5, enabling programmers to write code that scales with the number of cores.
Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory.
For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core (much like an L1 cache) and __syncthreads() is expected to be lightweight.
~~~~~~~~~~~~~~~~~~~~~~ - Great so I read ALL of this, being the studious programmer that I am, and the video throws me completely for a loop. It says map a 2D object
//First create a mapping from the 2D block and grid locations //to an absolute 2D location in the image, then use that to //calculate a 1D offset
I have no background for this. Not the powerpoint slides have given me a better idea what they might me because the example in the shared memory discussion maps shared memory arrays to the original image. (see slide 54)
But truly, that video which seemed to so important and I had to get an account for and has quizes, which is what I thought you were bases your quiz off of that I missed last week, is total garbage. It does not, in of itself, provide enough information to solve the examples that it presents without external resources, for which they seem to have none.
It is a TYPICAL UC production... especial schools like UC Davis. It is educational porn. It chews up your time with a lot of emotional drama, but it does NOTHING for your intellect. It is a huge time waster that sells you so message that they promote (nice to see NVIDEO CEO of research giving a useless shpeel in the middle of the lecture). But it is useless to teach anything from.
I can't stand all the smiling faces in that video and likeable personas. These are not educators. They do not know how to educate.
Ruben
> > ________________________________________ > From: Ruben Safir > Sent: Tuesday, November 1, 2016 8:16 PM > To: Samir Iabbassen; learn-at-nylxs.com > Subject: how is it indexing in cuda > > This slide says we have block and threads. If we are indexing on both > then it says > > - With M threads/block a unique index for each thread is given by: > int index = threadIdx.x + blockIdx.x * M; > > > I don't understand this at all. > > this is slide 39 > > > I have two coordinates. Why am I adding M? > > -- > So many immigrant groups have swept through our town > that Brooklyn, like Atlantis, reaches mythological > proportions in the mind of the world - RI Safir 1998 > http://www.mrbrklyn.com > > DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 > http://www.nylxs.com - Leadership Development in Free Software > http://www2.mrbrklyn.com/resources - Unpublished Archive > http://www.coinhangout.com - coins! > http://www.brooklyn-living.com > > Being so tracked is for FARM ANIMALS and and extermination camps, > but incompatible with living as a free human being. -RI Safir 2013 > >
-- So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998 http://www.mrbrklyn.com
DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 http://www.nylxs.com - Leadership Development in Free Software http://www2.mrbrklyn.com/resources - Unpublished Archive http://www.coinhangout.com - coins! http://www.brooklyn-living.com
Being so tracked is for FARM ANIMALS and and extermination camps, but incompatible with living as a free human being. -RI Safir 2013 _______________________________________________ Learn mailing list Learn-at-nylxs.com http://lists.mrbrklyn.com/mailman/listinfo/learn
|
|