蒋蒋的学习笔记

Hunyuan-OCR推理详解

模型结构图

Hunyuan-OCR结构图 图片地址:HunyuanOCR Technical Report

输入数据

{
    "model": "hunyuan-ocr",
        "messages": [
            {
                "role": "user",
                "content": [ 
                    {
                        "type": "image",
                        "image_url": 
                        {
                            "url": "https://images.liqucn.com/img/h02/h48/img_localize_2ec0a3d765e45582d4e06844969abb4a_480x854.png"
                        }
                    },               
                    {
                        "type": "text", 
                        "text": "检测并识别图片中的文字,将文本坐标格式化输出。"
                    }
                ]
            }
        ],
    "stream": false
}

preprocess

chat_template

tokenize

image preprocess

token处理

XD-RoPE位置id生成

Hunyuan-Vit

PatchEmbed

  1. conv2d
    • kernel_size: patch_size
    • stride: patch_size
    • out_channel: hidden_size=1152
    • (h / patch_size * w / patch_size, channel * patch_size, patch_size)
    • -> (h / patch_size * w / patch_size, channel, patch_size, patch_size)
    • -> (h / patch_size * w / patch_size, out_c)
  2. patch_pos_embed: (1, 138, 138, hidden_size)
    • 双线性插值->(1, grid_h, grid_w, hidden_size) * output: 第一部分输出加上第二部分的pos_embed

VisionLayers

PatchMerger: Adaptive MLP Connector

  1. before_rms
  2. proj
    • conv2d:
      • kernel_size: merge_size
      • stride_size: merge_size
      • in_c: hidden_size
      • out_c: hidden_size*2
    • gelu
    • conv2d:
      • kernel_size: 1
      • in_c: hidden_size*2
      • out_c: hidden_size*4
  3. cat image_newline:(hidden_size*4)
  4. mlp
    • in_c: hidden_size*4
    • out_c: text_hidden_size: 1024
  5. cat image_begin and image_end
  6. after_rms

Hunyuan 0.5B

rust推理代码

https://github.com/jhqxxx/aha/tree/main/src/models/hunyuan_ocr