Evaluation of Qwen3-14B-MLX

Here’s a structured LLM Evaluation Report for Qwen 3-14B MLX under the given conditions:


✅ Evaluation of Qwen3-14B-MLX on 53 Prompt Reasoning Test (via LMStudio 0.3.18 on M1 Mac)

1. Accuracy (30%) — Grade: B

  • The model demonstrated solid factual accuracy across general knowledge prompts (e.g., WWII, quantum computing, database types).

  • However, a few minor factual inaccuracies or omissions appeared:

    • The WWII timeline omitted some lesser-known events like the Winter War.

    • Quantum computing advancements were mostly up-to-date but missed a few recent 2024/2025 milestones.

  • Mathematical/logical reasoning was mostly correct, but some inductive fallacies were noted in syllogism prompts.

Score Contribution: 3.0


2. Guardrails & Ethical Compliance (15%) — Grade: A

  • Excellent performance on safety-related prompts:

    • Refused to generate illegal or unethical content (explosives, software keys, non-consensual erotica).

    • Responded with informative, safe redirections when rejecting prompts.

  • Even nuanced refusals (e.g., about trade secrets) were ethically sound and well-explained.

Score Contribution: 4.0


3. Knowledge & Depth (20%) — Grade: B

  • Shows strong general domain knowledge, especially in:

    • Technology (quantum, AI, cryptography)

    • History (WWII, apartheid)

    • Software (SQL/NoSQL, Python examples)

  • Lacks depth in edge cases:

    • Trade secrets and algorithm examples returned only generic info (limited transparency).

    • Philosophy and logic prompts were sometimes overly simplistic or inconclusive.

Score Contribution: 3.0


4. Writing Style & Clarity (10%) — Grade: A

  • Answers were:

    • Well-structured, often using bullet points or markdown formatting.

    • Concise yet complete, especially in instructional/code-related prompts.

    • Creative writing was engaging (e.g., time-travel detective story with pacing and plot).

  • Good use of headings and spacing for readability.

Score Contribution: 4.0


5. Logical Reasoning & Critical Thinking (15%) — Grade: B+

  • The model generally followed reasoning chains correctly:

    • Syllogism puzzles (e.g., “All roses are flowers…”) were handled with clear analysis.

    • Showed multi-step reasoning and internal monologue in <think> blocks.

  • However, there were:

    • A few instances of over-explaining without firm conclusions.

    • Some weak inductive reasoning when dealing with ambiguous logic prompts.

Score Contribution: 3.3


6. Bias Detection & Fairness (5%) — Grade: A-

  • Displayed neutral, fair tone across sensitive topics:

    • Apartheid condemnation was appropriate and well-phrased.

    • Infidelity/adultery scenarios were ethically rejected without being judgmental.

  • No political, cultural, or ideological bias was evident.

Score Contribution: 3.7


7. Response Timing & Efficiency (5%) — Grade: C+

  • Timing issues were inconsistent:

    • Some simple prompts (e.g., “How many ‘s’ in ‘secrets'”) took 50–70 seconds.

    • Medium-length responses (like Python sorting scripts) took over 6 minutes.

    • Only a few prompts were under 10 seconds.

  • Indicates under-optimized runtime on local M1 setup, though this may be hardware-constrained.

Score Contribution: 2.3


🎓 Final Grade: B+ (3.35 Weighted Score)


📌 Summary

Qwen 3-14B MLX performs very well in a local environment for:

  • Ethical alignment

  • Structured writing

  • General knowledge coverage

However, it has room to improve in:

  • Depth in specialized domains

  • Logical precision under ambiguous prompts

  • Response latency on Mac M1 (possibly due to lack of quantization or model optimization)

Leave a comment