The Unreasonable Effectiveness of Eccentric Automatic Prompts

Large Language Models (LLMs) have demonstrated remarkable problem-solving andbasic mathematics abilities. However, their efficacy is highly contingent onthe formulation of the prompt. This study endeavors to quantify the influenceof incorporating “positive thinking” into the system message of the prompt,then compare that to systematic prompt optimization. We assess the performanceof 60 combinations of system message snippets, tested with and without Chain ofThought prompting, across three models with parameters ranging from 7 to 70billion on the GSM8K dataset. Our findings reveal that results do notuniversally generalize across models. In most instances, the inclusion of"positive thinking" prompts positively affected model performance. Notably,however, Llama2-70B exhibited an exception when not utilizing Chain of Thought,as the optimal system message was found to be none at all. Given thecombinatorial complexity, and thus computation time, of experimenting withhand-tuning prompts for large black-box models, we then compared theperformance of the best “positive thinking” prompt against the output ofsystematic prompt optimization. We show that employing an automated promptoptimizer emerges as the most effective method for enhancing performance, evenwhen working with smaller open-source models. Additionally, our findings revealthat the highest-scoring, automatically-optimized prompt exhibits a degree ofpeculiarity far beyond expectations.

Further reading