سلام
دوستان چندتا سوال داشتم این هم خلاصه ایی ROI Align میباشد .
This is very similar to RoIPooling in Faster R-CNN. For each RoI, RoIPooling first "finds" the features in the feature maps that lie within the RoI's rectangle. Then it max-pools them to create a fixed size vector.
Problem: The coordinates where an RoI starts and ends may be non-integers. E.g. the top left corner might have coordinates (x=2.5, y=4.7). RoIPooling simply rounds these values to the nearest integers (e.g. (x=2, y=5)). But that can create pooled RoIs that are significantly off, as the feature maps with which RoIPooling works have high (total) stride (e.g. 32 pixels in standard ResNets). So being just one cell off can easily lead to being 32 pixels off on the input image. For classification, being some pixels off is usually not that bad. For masks however it can significantly worsen the results, as these have to be pixel-accurate. In RoIAlign this is compensated by not rounding the coordinates and instead using bilinear interpolation to interpolate between the feature map's cells. Each RoI is pooled by RoIAlign to a fixed sized feature map of size (H, W, F), with H and W usually being 7 or 14. (It can also generate different sizes, e.g. 7x7xF for classification and more accurate 14x14xF for masks.) If H and W are 7, this leads to 49 cells within each plane of the pooled feature maps. Each cell again is a rectangle -- similar to the RoIs -- and pooled with bilinear interpolation. More exactly, each cell is split up into four sub-cells (top left, top right, bottom right, bottom left). Each of these sub-cells is pooled via bilinear interpolation, leading to four values per cell. The final cell value is then computed using either an average or a maximum over the four sub-values.
1- مگه RPN مختصات کاندیداهای آبجکت ها رو به ROI Pooling نمیده ؟ پس چجوری این داره مختصات رو round میکنه مگر این مقادیر داخل ROI را MAX-POOL نمیکنه پس این چه ربطی به مختصات داره ؟
2- as the feature maps with which RoIPooling works have high (total) stride (e.g. 32 pixels in standard ResNets). So being just one cell off can easily lead to being 32 pixels off on the input image.
یعنی چی که stride=32 میباشد ؟ میدونم این stride چی هست خب مگر قرار نیست که هر ROI که بهش داده شد رو به وکتور ثابت تبدیل کنه اینوقت این stride اینجا برای چی استفاده میشود ؟