Defaulted container "vss" out of: vss, check-milvus-up (init), check-neo4j-up (init), check-llm-up (init) OPENAI_API_KEY_NAME is already set to: VSS_OPENAI_API_KEY NVIDIA_API_KEY_NAME is already set to: VSS_NVIDIA_API_KEY NGC_API_KEY_NAME is already set to: VSS_NGC_API_KEY /var/secrets/secrets.json file does not exist GPU has 5 decode engines Total GPU memory is 81920 MiB per GPU Auto-selecting VLM Batch Size to 128 release IGNORE Using vila-1.5 Starting VIA server in release mode 2025-02-13 08:08:51,964 INFO Initializing VIA Stream Handler 2025-02-13 08:08:51,965 INFO Initializing VLM pipeline 2025-02-13 08:08:51,969 INFO Using model cached at /tmp/via-ngc-model-cache/nim_nvidia_vila-1.5-40b_vila-yi-34b-siglip-stage3_1003_video_v8_vila-llama-3-8b-lita 2025-02-13 08:08:51,969 INFO num_vlm_procs set to 2 [2025-02-13 08:08:58,708] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-13 08:08:58,730] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-13 08:08:58,737] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-13 08:08:58,738] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-13 08:08:58,777] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-13 08:08:59,026] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600 [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600 [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600 [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 128 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 128 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2984 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2984 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2983 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 128 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 128 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2984 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2984 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2983 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 18234 MiB [TensorRT-LLM][INFO] Loaded engine size: 18234 MiB [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1185.54 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 18218 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 7.02 GB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 121.23 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.15 GiB, available: 50.16 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1370 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 47 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 20.07 GiB for max tokens in paged KV cache (87680). Loading checkpoint shards: 47%|████▋ | 7/15 [00:12<00:13, 1.74s/it]VILA TRT model load execution time = 18.595 sec [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1185.54 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 18218 (MiB) Loading checkpoint shards: 53%|█████▎ | 8/15 [00:13<00:11, 1.66s/it]TRT generate execution time = 1.047 sec [TensorRT-LLM][INFO] [MemUsageChange] Allocated 7.02 GB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 121.23 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.15 GiB, available: 50.16 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1370 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 47 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 20.07 GiB for max tokens in paged KV cache (87680). VILA TRT model load execution time = 20.600 sec Loading checkpoint shards: 67%|██████▋ | 10/15 [00:16<00:07, 1.59s/it]TRT generate execution time = 1.070 sec Loading checkpoint shards: 100%|██████████| 15/15 [00:22<00:00, 1.48s/it] Loading checkpoint shards: 100%|██████████| 15/15 [00:22<00:00, 1.47s/it] Loading checkpoint shards: 73%|███████▎ | 11/15 [00:18<00:06, 1.51s/it]VILA decoder Model load execution time = 24.821 sec VILA decoder Model load execution time = 25.124 sec Failed to query video capabilities: Invalid argument Failed to query video capabilities: Invalid argument Loading checkpoint shards: 87%|████████▋ | 13/15 [00:20<00:02, 1.36s/it]Decode execution time = 383.225 millisec Failed to query video capabilities: Invalid argument Decode execution time = 360.287 millisec Failed to query video capabilities: Invalid argument Loading checkpoint shards: 100%|██████████| 15/15 [00:21<00:00, 1.46s/it] Loading checkpoint shards: 100%|██████████| 15/15 [00:22<00:00, 1.47s/it] Decode execution time = 330.249 millisec Failed to query video capabilities: Invalid argument VILA Embeddings TRT Model load execution time = 26.811 sec Decode execution time = 354.177 millisec Failed to query video capabilities: Invalid argument VILA Embeddings generation execution time = 99.295 millisec VILA Embeddings TRT Model load execution time = 26.994 sec VILA Embeddings generation execution time = 78.027 millisec Decode execution time = 322.825 millisec Failed to query video capabilities: Invalid argument Decode execution time = 337.013 millisec Failed to query video capabilities: Invalid argument Decode execution time = 241.977 millisec Failed to query video capabilities: Invalid argument Decode execution time = 340.462 millisec Failed to query video capabilities: Invalid argument Decode execution time = 323.740 millisec Failed to query video capabilities: Invalid argument Decode execution time = 224.342 millisec Failed to query video capabilities: Invalid argument Decode execution time = 359.439 millisec Decode execution time = 210.807 millisec 2025-02-13 08:09:31,984 INFO Initialized VLM pipeline 2025-02-13 08:09:32,092 INFO Using meta/llama-3.1-70b-instruct as the summarization llm 2025-02-13 08:09:32,209 INFO Using meta/llama-3.1-70b-instruct as the cypher llm 2025-02-13 08:09:32,311 INFO Setting up GraphRAG 2025-02-13 08:09:32,524 INFO Initialized VIA Stream Handler INFO: Started server process [96] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: 127.0.0.1:37308 - "GET /health/ready HTTP/1.1" 200 OK Failed to query video capabilities: Invalid argument Failed to query video capabilities: Invalid argument Failed to query video capabilities: Invalid argument Failed to query video capabilities: Invalid argument INFO: 10.169.20.130:37186 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:37188 - "GET /health/ready HTTP/1.1" 200 OK Failed to query video capabilities: Invalid argument INFO: 10.169.20.130:55176 - "GET /health/ready HTTP/1.1" 200 OK Failed to query video capabilities: Invalid argument Failed to query video capabilities: Invalid argument Failed to query video capabilities: Invalid argument 2025-02-13 08:09:49 | ERROR | stderr | INFO: Started server process [8365] 2025-02-13 08:09:49 | ERROR | stderr | INFO: Waiting for application startup. 2025-02-13 08:09:49 | ERROR | stderr | INFO: Application startup complete. 2025-02-13 08:09:49 | ERROR | stderr | INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) 2025-02-13 08:09:49 | INFO | stdout | INFO: 127.0.0.1:47104 - "GET / HTTP/1.1" 200 OK *********************************************************** VIA Server loaded Backend is running at http://0.0.0.0:8000 Frontend is running at http://0.0.0.0:9000 Press ctrl+C to stop *********************************************************** INFO: 10.169.20.130:55178 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:55188 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:40216 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:40232 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:40234 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:44064 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:44080 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:44092 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:47294 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:47296 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:47304 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:36636 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:36644 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:36652 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:39922 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:39938 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:39928 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:41864 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:41880 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:41882 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:59334 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:59348 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:59354 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:41224 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:41228 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:41230 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:59176 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:59180 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:59178 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:55638 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:55642 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:55654 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:42432 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:42440 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:42446 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:60402 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:60426 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:60412 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:58592 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:58602 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:58606 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:60078 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:60084 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:60086 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:53320 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:53326 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:53330 - "GET /health/live HTTP/1.1" 200 OK INFO: 10.169.20.130:34610 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:34612 - "GET /health/ready HTTP/1.1" 200 OK INFO: 10.169.20.130:34614 - "GET /health/live HTTP/1.1" 200 OK