Benchmark performance on tasks like retrieval or translation measures task-completion, not semantic ...