Reinforcement learning-based routing (RLR) in wireless mesh networks has recently attracted the attention of several research groups. Several recent studies have demonstrated that RLR provides higher network performs better than traditional routing protocols.In most RLR protocols, nodes usean-greedy policy to select data transmission routesand update their Q-value tables. With this policy, the best route is chosen with a high probability, corresponding to the exploitation phase. The remaining routes are chosen with low probability, corresponding to the exploration phase.A challenge with the -greedy policy in RLR protocols is that data packets transmitted in the exploration phase have a high droppedprobability or a large end-to-end delay because they traverse long routes.In this paper, we propose an improved RLRfor wireless mesh networks to further improve its performance. Our approach is to improve the -greedy policy in RLR by generating additional control packets for transmission in the exploration phase. All data packets are transmitted during the exploitation phase. Simulation results using OMNeT++ showed thattheposed algorithm increases packet delivery ratio by an average value from 0.2 to 0.6%, anduces latency with an average value from 0.20 to 0.23 ms compared to the basic reinforcement learning-based routing algorithm.